research-article

Open access

ECOTLB: Eventually Consistent TLBs

Authors:

Mohan Kumar Kumar,

Tushar Krishna,

Abhishek BhattacharjeeAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 17, Issue 4

Article No.: 27, Pages 1 - 24

https://doi.org/10.1145/3409454

Published: 30 September 2020 Publication History

All formats PDF

Abstract

We propose ecoTLB—software-based eventual translation lookaside buffer (TLB) coherence—which eliminates the overhead of the synchronous TLB shootdown mechanism in operating systems that use address space identifiers (ASIDs). With an eventual TLB coherence, ecoTLB improves the performance of free and page swap operations by removing the inter-processor interrupt (IPI) overheads incurred to invalidate TLB entries. We show that the TLB shootdown has implications for page swapping in particular in emerging, disaggregated data centers and demonstrate that ecoTLB can improve both the performance and the specific swapping policy decisions using ecoTLB’s asynchronous mechanism. We demonstrate that ecoTLB improves the performance of real-world applications, such as Memcached and Make, that perform page swapping using Infiniswap, a solution for next generation data centers that use disaggregated memory, by up to 17.2%. Moreover, ecoTLB improves the 99th percentile tail latency of Memcached by up to 70.8% due to its asynchronous scheme and improved policy decisions. Furthermore, we show that recent features to improve security in the Linux kernel, like kernel page table isolation (KPTI), can result in significant performance overheads on architectures without support for specific instructions to clear single entries in tagged TLBs, falling back to full TLB flushes. In this scenario, ecoTLB is able to recover the performance lost for supporting KPTI due to its asynchronous shootdown scheme and its support for tagged TLBs. Finally, we demonstrate that ecoTLB improves the performance of free operations by up to 59.1% on a 120-core machine and improves the performance of Apache on a 16-core machine by up to 13.7% compared to baseline Linux, and by up to 48.2% compared to ABIS, a recent state-of-the-art research prototype that reduces the number of IPIs.

References

[1]

Nadav Amit. 2017. Optimizing the TLB shootdown algorithm with page access tracking. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 27--39.

[2]

Nadav Amit, Amy Tai, and Michael Wei. 2020. Don’t shoot down TLB shootdowns! In Proceedings of the 15th European Conference on Computer Systems (EuroSys’20). 1--14.

[3]

Lukasz Anaczkowski. 2016. Linux VM workaround for Knights Landing A/D leak. Retrieved from https://lkml.org/lkml/2016/6/14/505.

[4]

Ravi Arimilli, Guy Guthrie, and Kirk Livingston. 2004. Multiprocessor system supporting multiple outstanding TLBI operations per partition. Retrieved from https://www.google.com/patents/US20040215898 US Patent App. 10/425,425.

[5]

ARM. 2014. ARM Compiler Reference Guide: TLBI. Retrieved from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/TLBI_SYS.html.

[6]

Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In Proceedings of the 12th ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’12). 53--64.

Digital Library

[7]

Amro Awad, Arkaprava Basu, Sergey Blagodurov, Yan Solihin, and Gabriel H. Loh. 2017. Avoiding TLB shootdowns through self-invalidating TLB entries. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT’17). 273--287.

[8]

Ramesh Balan and Kurt Gollhard. 1992. A scalable implementation of virtual memory HAT layer for shared memory multiprocessor machine. In Proceedings of the USENIX Annual Technical Conference (ATC’92). 107--115.

[9]

Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. 2017. Attack of the killer microseconds. Commun. ACM 60, 4 (Mar. 2017), 48--54.

Digital Library

[10]

T. Baruah, Y. Sun, A. T. Dinçer, S. A. Mojumder, J. L. Abellán, Y. Ukidave, A. Joshi, N. Rubin, J. Kim, and D. Kaeli. 2020. Griffin: Hardware-software support for efficient page migration in multi-GPU systems. In Proceedings of the 26th IEEE Symposium on High Performance Computer Architecture (HPCA’20). 596--609.

[11]

Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP’09). 29--44.

Digital Library

[12]

S. Bharadwaj, G. Cox, T. Krishna, and A. Bhattacharjee. 2018. Scalable distributed last-level TLBs using low-latency interconnects. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’18). 271--284.

[13]

Abhishek Bhattacharjee. 2017. Translation-triggered prefetching. In Proceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). 63--76.

Digital Library

[14]

Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In Proceedings of the 17th IEEE Symposium on High Performance Computer Architecture (HPCA’11). 62--73.

[15]

David L. Black, Richard F. Rashid, David B. Golub, Charles R. Hill, and Robert V. Baron. 1989. Translation lookaside buffer consistency: A software approach. In Proceedings of the 3rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’89). 113--122.

[16]

Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang. 2008. Corey: An operating system for many cores. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI’08). 43--57.

[17]

Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. 2013. RadixVM: Scalable address spaces for multithreaded applications. In Proceedings of the 8th European Conference on Computer Systems (EuroSys’13). 211--224.

Digital Library

[18]

Jonathan Corbet. 2017. The current state of kernel page-table isolation. Retrieved from https://lwn.net/Articles/741878/.

[19]

Christopher Covington. 2016. arm64: Work around Falkor erratum 1003. Retrieved from https://lkml.org/lkml/2016/12/29/267.

[20]

Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient address translation for architectures with multiple page sizes. In Proceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). 435--448.

Digital Library

[21]

Thurston H. Y. Dang, Petros Maniatis, and David Wagner. 2017. Oscar: A practical page-permissions-based scheme for thwarting dangling pointers. In Proceedings of the 26th USENIX Security Symposium (USENIX Security’17). 815--832.

[22]

Linux Kernel Driver Database. 2017. CONFIG_ARM_ERRATA_720789. Retrieved from http://cateee.net/lkddb/web-lkddb/ARM_ERRATA_720789.html.

[23]

Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 249--264.

[24]

Will Glozer. 2015. wrk - a HTTP benchmarking tool. Retrieved from https://github.com/wg/wrk.

[25]

Google. 2018. CPU Platforms. Retrieved from https://cloud.google.com/compute/docs/cpu-platforms.

[26]

Intel. 2010. Intel 64 Architecture x2APIC Specification. Retrieved from https://software.intel.com/content/www/us/en/develop/download/intel-64-architecture-x2apic-specification.html.

[27]

Intel 2017. 5-Level Paging and 5-Level EPT. Retrieved from https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf.

[28]

Gu Juncheng, Lee Youngmoon, Zhang Yiwen, Chowdhury Mosharaf, and Shin Kang. 2017. Efficient memory disaggregation with INFINISWAP. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI’17).

[29]

Vasileios Karakostas, Jayneel Gandhi, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, and Osman S. Ünsal. 2016. Energy-efficient address translation. In Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA’16). 631--643.

[30]

Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, and Sanjeev Kumar. 2016. Flash storage disaggregation. In Proceedings of the 11th European Conference on Computer Systems (EuroSys’16). 29:1–29:15.

Digital Library

[31]

Mohan Kumar, Steffen Maass, Sanidhya Kashyap, Ján Veselý, Zi Yan, Taesoo Kim, Abhishek Bhattacharjee, and Tushar Krishna. 2018. LATR: Lazy translation coherence. In Proceedings of the 23rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). 651--664.

Digital Library

[32]

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International World Wide Web Conference (WWW’10). 591--600.

Digital Library

[33]

Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2016. Coordinated and efficient huge page management with Ingens. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 705--721.

[34]

Jacob Leverich. 2017. Mutilate: High-Performance Memcached Load Generator. Retrieved from https://github.com/leverich/mutilate.

[35]

Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. 2018. Meltdown. ArXiv e-prints (Jan. 2018). arxiv:1801.01207.

[36]

Daniel Lustig, Abhishek Bhattacharjee, and Margaret Martonosi. 2013. TLB improvements for chip multiprocessors: Inter-core cooperative prefetchers and shared last-level TLBs. ACM Trans. Archit. Code Optim. 10, 1 (Apr. 2013), 2:1–2:38.

Digital Library

[37]

Daniel Lustig, Geet Sethi, Margaret Martonosi, and Abhishek Bhattacharjee. 2016. COATCheck: Verifying memory ordering at the hardware-OS interface. In Proceedings of the 21st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’16). 233--247.

Digital Library

[38]

Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, and Taesoo Kim. 2017. Mosaic: Processing a trillion-edge graph on a single machine. In Proceedings of the 12th European Conference on Computer Systems (EuroSys’17). 527--543.

Digital Library

[39]

Mellanox 2017. ConnectX-3 Single/Dual-Port Adapter with VPI. Retrieved from http://www.mellanox.com/page/products_dyn?product_family=1198mtag=connectx_3_vpi.

[40]

Memcached 2017. A high-performance, distributed memory object caching system. Retrieved from http://memcached.org/.

[41]

Timothy Prickett Morgan. 2017. AMD Disrupts the Two-Socket Server Status Quo. Retrieved from https://www.nextplatform.com/2017/05/17/amd-disrupts-two-socket-server-status-quo/.

[42]

Mark Oskin and Gabriel H. Loh. 2015. A software-managed approach to die-stacked DRAM. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT’15). 188--200.

[43]

Chang Hyun Park, Sanghoon Cha, Bokyeong Kim, Youngjin Kwon, David Black-Schaffer, and Jaehyuk Huh. 2020. Perforated page: Supporting fragmented memory allocation for large pages. In Proceedings of the 47th ACM/IEEE International Symposium on Computer Architecture (ISCA’20). 913--925.

Digital Library

[44]

J. Kent Peacock, Sunil Saxena, Dean Thomas, Fred Yang, and Wilfred Yu. 1992. Experiences from multithreading system V release 4. In Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS’92). 77--91.

[45]

Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced large-reach TLBs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). 258--269.

Digital Library

[46]

Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H. Loh. 2014. Increasing TLB reach by exploiting clustering in page translations. In Proceedings of the 20th IEEE Symposium on High Performance Computer Architecture (HPCA). 558--567.

[47]

Binh Pham, Derek Hower, Abhishek Bhattacharjee, and Trey Cain. 2017. TLB shootdown mitigation for low-power, many-core servers with L1 virtual caches. IEEE Comput. Archit. Lett. 17, 1 (June 2017).

[48]

Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural support for address translation on GPUs: Designing memory management units for CPU/GPUs with unified address spaces. In Proceedings of the 19th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). 743--758.

Digital Library

[49]

Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 20th IEEE Symposium on High Performance Computer Architecture (HPCA’14). 568--578.

[50]

Bogdan F. Romanescu, Alvin R. Lebeck, and Daniel J. Sorin. 2010. Specifying and dynamically verifying address translation-aware memory consistency. In Proceedings of the 15th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). 323--334.

[51]

Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin, and Anne Bracy. 2010. UNified instruction/translation/data (UNITD) coherence: One protocol to rule them all. In Proceedings of the 16th IEEE Symposium on High Performance Computer Architecture (HPCA’10). 1--12.

[52]

ScyllaDB. 2015. Memcached Benchmark. Retrieved from https://github.com/scylladb/seastar/wiki/Memcached-Benchmark.

[53]

Anand Lal Shimpi. 2008. AMD’s B3 stepping Phenom previewed, TLB hardware fix tested. Retrieved from http://www.anandtech.com/show/2477/2.

[54]

Patricia Teller. 1990. Translation-lookaside buffer consistency. Computer 23, 6 (June 1990), 26--36.

Digital Library

[55]

Patricia J. Teller, Richard Kenner, and Marc Snir. 1988. TLB consistency on highly-parallel shared-memory multiprocessors. In Proceedings of the 21st Annual Hawaii International Conference on System Sciences. Volume I: Architecture Track, Vol. 1. 184--193.

[56]

Scott Rixner, Thomas Barr, and Alan Cox. 2011. SpecTLB: A mechanism for speculative address translation. In Proceedings of the 38th ACM/IEEE International Symposium on Computer Architecture (ISCA’11). 307--318.

[57]

Michael Y. Thompson, J. M. Barton, T. A. Jermoluk, and J. C. Wagner. 1988. Translation lookaside buffer synchronization in a multiprocessor system. In Proceedings of the USENIX Annual Technical Conference (ATC’88).

[58]

Linus Torvalds. 2017. Linux Kernel. Retrieved from https://github.com/torvalds/linux.

[59]

Theo Valich. 2007. Intel explains the Core 2 CPU errata. Retrieved from http://www.theinquirer.net/inquirer/news/1031406/intel-explains-core-cpu-errata.

[60]

Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova, Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho Navarro, Adrián Cristal, and Osman S. Ünsal. 2011. DiDi: Mitigating the performance impact of TLB shootdowns using a shared TLB directory. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 340--349.

[61]

Xiaoyuan Wang, Haikun Liu, Xiaofei Liao, Ji Chen, Hai Jin, Yu Zhang, Long Zheng, Bingsheng He, and Song Jiang. 2019. Supporting superpages and lightweight page migration in hybrid memory systems. ACM Trans. Archit. Code Optim. 16, 2 (Apr. 2019), 11:1–11:26.

Digital Library

[62]

Zi Yan, Ján Veselý, Guilherme Cox, and Abhishek Bhattacharjee. 2017. Hardware translation coherence for virtualized systems. In Proceedings of the 44th ACM/IEEE International Symposium on Computer Architecture (ISCA’17). 430--443.

Digital Library

[63]

Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. 2019. Nimble page management for tiered memory systems. In Proceedings of the 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). 331--345.

Digital Library

Cited By

Alam FLee HBhattacharjee AAwad A(2023)CryptoMMU: Enabling Scalable and Secure Access Control of Third-Party AcceleratorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614311(32-48)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614311
Dashora RBabu M(2022)A Survey on Advancements of Real-Time Analytics Architecture ComponentsComputational Methods and Data Engineering10.1007/978-981-19-3015-7_41(547-559)Online publication date: 9-Sep-2022
https://doi.org/10.1007/978-981-19-3015-7_41
Gupta SBhattacharyya AOh YBhattacharjee AFalsafi BPayer MMartínez JDuato JJohn L(2021)Rebooting virtual memory with midgardProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00047(512-525)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00047

Index Terms

ECOTLB: Eventually Consistent TLBs
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Virtual memory

Recommendations

LATR: Lazy Translation Coherence
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

We propose LATR-lazy TLB coherence-a software-based TLB shootdown mechanism that can alleviate the overhead of the synchronous TLB shootdown mechanism in existing operating systems. By handling the TLB coherence in a lazy fashion, LATR can avoid ...
LATR: Lazy Translation Coherence
ASPLOS '18

We propose LATR-lazy TLB coherence-a software-based TLB shootdown mechanism that can alleviate the overhead of the synchronous TLB shootdown mechanism in existing operating systems. By handling the TLB coherence in a lazy fashion, LATR can avoid ...
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and design

While set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 17, Issue 4

December 2020

430 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3427420

Editor:
David Kaeli
Northeastern University, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2020

Accepted: 01 July 2020

Revised: 01 June 2020

Received: 01 November 2019

Published in TACO Volume 17, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

ETRI
NSF

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
1,201
Total Downloads

Downloads (Last 12 months)340
Downloads (Last 6 weeks)33

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Alam FLee HBhattacharjee AAwad A(2023)CryptoMMU: Enabling Scalable and Secure Access Control of Third-Party AcceleratorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614311(32-48)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614311
Dashora RBabu M(2022)A Survey on Advancements of Real-Time Analytics Architecture ComponentsComputational Methods and Data Engineering10.1007/978-981-19-3015-7_41(547-559)Online publication date: 9-Sep-2022
https://doi.org/10.1007/978-981-19-3015-7_41
Gupta SBhattacharyya AOh YBhattacharjee AFalsafi BPayer MMartínez JDuato JJohn L(2021)Rebooting virtual memory with midgardProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00047(512-525)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00047

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents