Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

ECOTLB: Eventually Consistent TLBs

Published: 30 September 2020 Publication History
  • Get Citation Alerts
  • Abstract

    We propose ecoTLB—software-based eventual translation lookaside buffer (TLB) coherence—which eliminates the overhead of the synchronous TLB shootdown mechanism in operating systems that use address space identifiers (ASIDs). With an eventual TLB coherence, ecoTLB improves the performance of free and page swap operations by removing the inter-processor interrupt (IPI) overheads incurred to invalidate TLB entries. We show that the TLB shootdown has implications for page swapping in particular in emerging, disaggregated data centers and demonstrate that ecoTLB can improve both the performance and the specific swapping policy decisions using ecoTLB’s asynchronous mechanism. We demonstrate that ecoTLB improves the performance of real-world applications, such as Memcached and Make, that perform page swapping using Infiniswap, a solution for next generation data centers that use disaggregated memory, by up to 17.2%. Moreover, ecoTLB improves the 99th percentile tail latency of Memcached by up to 70.8% due to its asynchronous scheme and improved policy decisions. Furthermore, we show that recent features to improve security in the Linux kernel, like kernel page table isolation (KPTI), can result in significant performance overheads on architectures without support for specific instructions to clear single entries in tagged TLBs, falling back to full TLB flushes. In this scenario, ecoTLB is able to recover the performance lost for supporting KPTI due to its asynchronous shootdown scheme and its support for tagged TLBs. Finally, we demonstrate that ecoTLB improves the performance of free operations by up to 59.1% on a 120-core machine and improves the performance of Apache on a 16-core machine by up to 13.7% compared to baseline Linux, and by up to 48.2% compared to ABIS, a recent state-of-the-art research prototype that reduces the number of IPIs.

    References

    [1]
    Nadav Amit. 2017. Optimizing the TLB shootdown algorithm with page access tracking. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 27--39.
    [2]
    Nadav Amit, Amy Tai, and Michael Wei. 2020. Don’t shoot down TLB shootdowns! In Proceedings of the 15th European Conference on Computer Systems (EuroSys’20). 1--14.
    [3]
    Lukasz Anaczkowski. 2016. Linux VM workaround for Knights Landing A/D leak. Retrieved from https://lkml.org/lkml/2016/6/14/505.
    [4]
    Ravi Arimilli, Guy Guthrie, and Kirk Livingston. 2004. Multiprocessor system supporting multiple outstanding TLBI operations per partition. Retrieved from https://www.google.com/patents/US20040215898 US Patent App. 10/425,425.
    [5]
    ARM. 2014. ARM Compiler Reference Guide: TLBI. Retrieved from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/TLBI_SYS.html.
    [6]
    Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In Proceedings of the 12th ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’12). 53--64.
    [7]
    Amro Awad, Arkaprava Basu, Sergey Blagodurov, Yan Solihin, and Gabriel H. Loh. 2017. Avoiding TLB shootdowns through self-invalidating TLB entries. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT’17). 273--287.
    [8]
    Ramesh Balan and Kurt Gollhard. 1992. A scalable implementation of virtual memory HAT layer for shared memory multiprocessor machine. In Proceedings of the USENIX Annual Technical Conference (ATC’92). 107--115.
    [9]
    Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. 2017. Attack of the killer microseconds. Commun. ACM 60, 4 (Mar. 2017), 48--54.
    [10]
    T. Baruah, Y. Sun, A. T. Dinçer, S. A. Mojumder, J. L. Abellán, Y. Ukidave, A. Joshi, N. Rubin, J. Kim, and D. Kaeli. 2020. Griffin: Hardware-software support for efficient page migration in multi-GPU systems. In Proceedings of the 26th IEEE Symposium on High Performance Computer Architecture (HPCA’20). 596--609.
    [11]
    Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP’09). 29--44.
    [12]
    S. Bharadwaj, G. Cox, T. Krishna, and A. Bhattacharjee. 2018. Scalable distributed last-level TLBs using low-latency interconnects. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’18). 271--284.
    [13]
    Abhishek Bhattacharjee. 2017. Translation-triggered prefetching. In Proceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). 63--76.
    [14]
    Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In Proceedings of the 17th IEEE Symposium on High Performance Computer Architecture (HPCA’11). 62--73.
    [15]
    David L. Black, Richard F. Rashid, David B. Golub, Charles R. Hill, and Robert V. Baron. 1989. Translation lookaside buffer consistency: A software approach. In Proceedings of the 3rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’89). 113--122.
    [16]
    Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang. 2008. Corey: An operating system for many cores. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI’08). 43--57.
    [17]
    Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. 2013. RadixVM: Scalable address spaces for multithreaded applications. In Proceedings of the 8th European Conference on Computer Systems (EuroSys’13). 211--224.
    [18]
    Jonathan Corbet. 2017. The current state of kernel page-table isolation. Retrieved from https://lwn.net/Articles/741878/.
    [19]
    Christopher Covington. 2016. arm64: Work around Falkor erratum 1003. Retrieved from https://lkml.org/lkml/2016/12/29/267.
    [20]
    Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient address translation for architectures with multiple page sizes. In Proceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). 435--448.
    [21]
    Thurston H. Y. Dang, Petros Maniatis, and David Wagner. 2017. Oscar: A practical page-permissions-based scheme for thwarting dangling pointers. In Proceedings of the 26th USENIX Security Symposium (USENIX Security’17). 815--832.
    [22]
    Linux Kernel Driver Database. 2017. CONFIG_ARM_ERRATA_720789. Retrieved from http://cateee.net/lkddb/web-lkddb/ARM_ERRATA_720789.html.
    [23]
    Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 249--264.
    [24]
    Will Glozer. 2015. wrk - a HTTP benchmarking tool. Retrieved from https://github.com/wg/wrk.
    [25]
    Google. 2018. CPU Platforms. Retrieved from https://cloud.google.com/compute/docs/cpu-platforms.
    [26]
    Intel. 2010. Intel 64 Architecture x2APIC Specification. Retrieved from https://software.intel.com/content/www/us/en/develop/download/intel-64-architecture-x2apic-specification.html.
    [27]
    Intel 2017. 5-Level Paging and 5-Level EPT. Retrieved from https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf.
    [28]
    Gu Juncheng, Lee Youngmoon, Zhang Yiwen, Chowdhury Mosharaf, and Shin Kang. 2017. Efficient memory disaggregation with INFINISWAP. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI’17).
    [29]
    Vasileios Karakostas, Jayneel Gandhi, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, and Osman S. Ünsal. 2016. Energy-efficient address translation. In Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA’16). 631--643.
    [30]
    Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, and Sanjeev Kumar. 2016. Flash storage disaggregation. In Proceedings of the 11th European Conference on Computer Systems (EuroSys’16). 29:1–29:15.
    [31]
    Mohan Kumar, Steffen Maass, Sanidhya Kashyap, Ján Veselý, Zi Yan, Taesoo Kim, Abhishek Bhattacharjee, and Tushar Krishna. 2018. LATR: Lazy translation coherence. In Proceedings of the 23rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). 651--664.
    [32]
    Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International World Wide Web Conference (WWW’10). 591--600.
    [33]
    Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2016. Coordinated and efficient huge page management with Ingens. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 705--721.
    [34]
    Jacob Leverich. 2017. Mutilate: High-Performance Memcached Load Generator. Retrieved from https://github.com/leverich/mutilate.
    [35]
    Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. 2018. Meltdown. ArXiv e-prints (Jan. 2018). arxiv:1801.01207.
    [36]
    Daniel Lustig, Abhishek Bhattacharjee, and Margaret Martonosi. 2013. TLB improvements for chip multiprocessors: Inter-core cooperative prefetchers and shared last-level TLBs. ACM Trans. Archit. Code Optim. 10, 1 (Apr. 2013), 2:1–2:38.
    [37]
    Daniel Lustig, Geet Sethi, Margaret Martonosi, and Abhishek Bhattacharjee. 2016. COATCheck: Verifying memory ordering at the hardware-OS interface. In Proceedings of the 21st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’16). 233--247.
    [38]
    Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, and Taesoo Kim. 2017. Mosaic: Processing a trillion-edge graph on a single machine. In Proceedings of the 12th European Conference on Computer Systems (EuroSys’17). 527--543.
    [39]
    Mellanox 2017. ConnectX-3 Single/Dual-Port Adapter with VPI. Retrieved from http://www.mellanox.com/page/products_dyn?product_family=1198mtag=connectx_3_vpi.
    [40]
    Memcached 2017. A high-performance, distributed memory object caching system. Retrieved from http://memcached.org/.
    [41]
    Timothy Prickett Morgan. 2017. AMD Disrupts the Two-Socket Server Status Quo. Retrieved from https://www.nextplatform.com/2017/05/17/amd-disrupts-two-socket-server-status-quo/.
    [42]
    Mark Oskin and Gabriel H. Loh. 2015. A software-managed approach to die-stacked DRAM. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT’15). 188--200.
    [43]
    Chang Hyun Park, Sanghoon Cha, Bokyeong Kim, Youngjin Kwon, David Black-Schaffer, and Jaehyuk Huh. 2020. Perforated page: Supporting fragmented memory allocation for large pages. In Proceedings of the 47th ACM/IEEE International Symposium on Computer Architecture (ISCA’20). 913--925.
    [44]
    J. Kent Peacock, Sunil Saxena, Dean Thomas, Fred Yang, and Wilfred Yu. 1992. Experiences from multithreading system V release 4. In Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS’92). 77--91.
    [45]
    Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced large-reach TLBs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). 258--269.
    [46]
    Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H. Loh. 2014. Increasing TLB reach by exploiting clustering in page translations. In Proceedings of the 20th IEEE Symposium on High Performance Computer Architecture (HPCA). 558--567.
    [47]
    Binh Pham, Derek Hower, Abhishek Bhattacharjee, and Trey Cain. 2017. TLB shootdown mitigation for low-power, many-core servers with L1 virtual caches. IEEE Comput. Archit. Lett. 17, 1 (June 2017).
    [48]
    Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural support for address translation on GPUs: Designing memory management units for CPU/GPUs with unified address spaces. In Proceedings of the 19th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). 743--758.
    [49]
    Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 20th IEEE Symposium on High Performance Computer Architecture (HPCA’14). 568--578.
    [50]
    Bogdan F. Romanescu, Alvin R. Lebeck, and Daniel J. Sorin. 2010. Specifying and dynamically verifying address translation-aware memory consistency. In Proceedings of the 15th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). 323--334.
    [51]
    Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin, and Anne Bracy. 2010. UNified instruction/translation/data (UNITD) coherence: One protocol to rule them all. In Proceedings of the 16th IEEE Symposium on High Performance Computer Architecture (HPCA’10). 1--12.
    [52]
    ScyllaDB. 2015. Memcached Benchmark. Retrieved from https://github.com/scylladb/seastar/wiki/Memcached-Benchmark.
    [53]
    Anand Lal Shimpi. 2008. AMD’s B3 stepping Phenom previewed, TLB hardware fix tested. Retrieved from http://www.anandtech.com/show/2477/2.
    [54]
    Patricia Teller. 1990. Translation-lookaside buffer consistency. Computer 23, 6 (June 1990), 26--36.
    [55]
    Patricia J. Teller, Richard Kenner, and Marc Snir. 1988. TLB consistency on highly-parallel shared-memory multiprocessors. In Proceedings of the 21st Annual Hawaii International Conference on System Sciences. Volume I: Architecture Track, Vol. 1. 184--193.
    [56]
    Scott Rixner, Thomas Barr, and Alan Cox. 2011. SpecTLB: A mechanism for speculative address translation. In Proceedings of the 38th ACM/IEEE International Symposium on Computer Architecture (ISCA’11). 307--318.
    [57]
    Michael Y. Thompson, J. M. Barton, T. A. Jermoluk, and J. C. Wagner. 1988. Translation lookaside buffer synchronization in a multiprocessor system. In Proceedings of the USENIX Annual Technical Conference (ATC’88).
    [58]
    Linus Torvalds. 2017. Linux Kernel. Retrieved from https://github.com/torvalds/linux.
    [59]
    Theo Valich. 2007. Intel explains the Core 2 CPU errata. Retrieved from http://www.theinquirer.net/inquirer/news/1031406/intel-explains-core-cpu-errata.
    [60]
    Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova, Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho Navarro, Adrián Cristal, and Osman S. Ünsal. 2011. DiDi: Mitigating the performance impact of TLB shootdowns using a shared TLB directory. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 340--349.
    [61]
    Xiaoyuan Wang, Haikun Liu, Xiaofei Liao, Ji Chen, Hai Jin, Yu Zhang, Long Zheng, Bingsheng He, and Song Jiang. 2019. Supporting superpages and lightweight page migration in hybrid memory systems. ACM Trans. Archit. Code Optim. 16, 2 (Apr. 2019), 11:1–11:26.
    [62]
    Zi Yan, Ján Veselý, Guilherme Cox, and Abhishek Bhattacharjee. 2017. Hardware translation coherence for virtualized systems. In Proceedings of the 44th ACM/IEEE International Symposium on Computer Architecture (ISCA’17). 430--443.
    [63]
    Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. 2019. Nimble page management for tiered memory systems. In Proceedings of the 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). 331--345.

    Cited By

    View all
    • (2023)CryptoMMU: Enabling Scalable and Secure Access Control of Third-Party AcceleratorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614311(32-48)Online publication date: 28-Oct-2023
    • (2022)A Survey on Advancements of Real-Time Analytics Architecture ComponentsComputational Methods and Data Engineering10.1007/978-981-19-3015-7_41(547-559)Online publication date: 9-Sep-2022
    • (2021)Rebooting virtual memory with midgardProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00047(512-525)Online publication date: 14-Jun-2021

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 17, Issue 4
    December 2020
    430 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3427420
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 September 2020
    Accepted: 01 July 2020
    Revised: 01 June 2020
    Received: 01 November 2019
    Published in TACO Volume 17, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. TLB
    2. asynchrony
    3. translation coherence

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • ETRI
    • NSF

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)340
    • Downloads (Last 6 weeks)33
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)CryptoMMU: Enabling Scalable and Secure Access Control of Third-Party AcceleratorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614311(32-48)Online publication date: 28-Oct-2023
    • (2022)A Survey on Advancements of Real-Time Analytics Architecture ComponentsComputational Methods and Data Engineering10.1007/978-981-19-3015-7_41(547-559)Online publication date: 9-Sep-2022
    • (2021)Rebooting virtual memory with midgardProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00047(512-525)Online publication date: 14-Jun-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media