Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3317550.3321424acmconferencesArticle/Chapter ViewAbstractPublication PageshotosConference Proceedingsconference-collections
research-article
Open access

Project PBerry: FPGA Acceleration for Remote Memory

Published: 13 May 2019 Publication History

Abstract

Recent research efforts propose remote memory systems that pool memory from multiple hosts. These systems rely on the virtual memory subsystem to track application memory accesses and transparently offer remote memory to applications. We outline several limitations of this approach, such as page fault overheads and dirty data amplification. Instead, we argue for a fundamentally different approach: leverage the local host's cache coherence traffic to track application memory accesses at cache line granularity. Our approach uses emerging cache-coherent FPGAs to expose cache coherence events to the operating system. This approach not only accelerates remote memory systems by reducing dirty data amplification and by eliminating page faults, but also enables other use cases, such as live virtual machine migration, unified virtual memory, security and code analysis. All of these use cases open up many promising research directions.

References

[1]
CCIX. https://www.ccixconsortium.com.
[2]
Enzian, a research computer built by the Systems Group at ETH Zürich. http://www.enzian.systems/index.html.
[3]
P.Haul. https://criu.org/P.Haul.
[4]
Pin - a dynamic binary instrumentation tool. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.
[5]
Redis: open-source, in-memory data structure store. https://redis.io.
[6]
Serving DNNs in real time at datacenter scale with Project Brainwave. https://www.microsoft.com/en-us/research/uploads/prod/2018/03/mi0218_Chung-2018Mar25.pdf.
[7]
Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novakovic, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. Remote regions: a simple abstraction for remote memory. In USENIX Annual Technical Conference (ATC), Boston, MA, 2018.
[8]
Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. Remote memory in the age of fast networks. In ACM Symposium on Cloud Computing (SoCC), 2017.
[9]
Cristiana Amza, Alan L. Cox, Shandya Dwarkadas, Pete Keleher, Honghui Lu, Ramakrishnan Rajamony, Weimin Yu, and Willy Zwaenepoel. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer, February 1996.
[10]
Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. Lime: A Java-compatible and synthesizable language for heterogeneous architectures. 2010.
[11]
Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. Attack of the killer microseconds. Communications of the ACM, March 2017.
[12]
J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Munin: Distributed shared memory based on type-specific memory coherence. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), March 1990.
[13]
Abhishek Bhattacharjee. Translation-triggered prefetching. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017.
[14]
Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, July 1970.
[15]
M. Blott and K. Vissers. Dataflow architectures for 10 Gbps line-rate key-value-stores. In IEEE Hot Chips 25 Symposium (HCS), 2013.
[16]
Greg Bronevetsky, Daniel Marques, Keshav Pingali, Peter Szwed, and Martin Schulz. Application-level checkpointing for shared memory programs. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2004.
[17]
Derek Bruening, Qin Zhao, and Saman Amarasinghe. Transparent dynamic instrumentation. In International Conference on Virtual Execution Environments (VEE), 2012.
[18]
Irina Calciu, Siddhartha Sen, Mahesh Balakrishnan, and Marcos K. Aguilera. Black-box concurrent data structures for NUMA architectures. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017.
[19]
Marco Chiappetta, Erkay Savas, and Cemal Yilmaz. Real time detection of cache-based side-channel attacks using hardware performance counters. Applied Soft Computing, 49(C), December 2016.
[20]
Christopher Clark, Keir Fraser, Steven H, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. Live migration of virtual machines. In Symposium on Networked Systems Design and Implementation (NSDI), 2005.
[21]
Convey Computer. The Convey HC-2 Computer. Architectural Overview. https://www.micron.eom/~/media/documents/products/white-paper/wp_convey_hc2_architectual_overview.pdf, 2012.
[22]
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast remote memory. In Symposium on Networked Systems Design and Implementation (NSDI), April 2014.
[23]
Aleksandar Dragojević, Dushyanth Narayanan, Ed Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. No compromises: distributed transactions with consistency, availability, and performance. In ACM Symposium on Operating Systems Principles (SOSP), October 2015.
[24]
Jake Edge. DAX, mmap(), and a "go faster" flag. https://lwn.net/Articles/684828/.
[25]
Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. Network requirements for resource disaggregation. In Symposium on Operating Systems Design and Implementation (OSDI), October 2016.
[26]
G. Gibb, J. W. Lockwood, J. Naous, P. Hartke, and N. McKeown. NetFPGA: An open platform for teaching how to build Gigabit-rate network switches and routers. IEEE Transactions on Education, 2008.
[27]
Heiner Giefers, Raphael Polig, and Christoph Hagleitner. Accelerating arithmetic kernels with coherent attached fpga coprocessors. In Design, Automation & Test in Europe (DATE), 2015.
[28]
Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. Efficient memory disaggregation with Infiniswap. In Symposium on Networked Systems Design and Implementation (NSDI), 2017.
[29]
Mark Harris. Unified Memory in CUDA 6. https://devblogs.nvidia.com/unified-memory-in-cuda-6/.
[30]
Zecheng He and Ruby B. Lee. How secure is your cache against side-channel attacks? In International Symposium on Microarchitecture (MICRO), 2017.
[31]
Zhenhao He, David Sidler, Zsolt István, and Gustavo Alonso. A flexible k-means operator for hybrid databases. In "International Conference on Field Programmable Logic and Applications (FPL)", 2018.
[32]
John L. Hennessy and David A. Patterson. Computer Architecture, Fifth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., 2011.
[33]
Michael Henson and Stephen Taylor. Memory encryption: A survey of existing techniques. ACM Computing Surveys, March 2014.
[34]
Michael R. Hines, Umesh Deshpande, and Kartik Gopalan. Post-copy live migration of virtual machines. Operating Systems Review, July 2009.
[35]
Intel. EPT-based Sub-Page Permissions. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf.
[36]
Intel. Intel® Architecture Instruction Set Extensions Programming Reference. https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf.
[37]
Intel. Intel® Xeon®+FPGA Platform for the Data Center. http://reconfigurablecomputing4themasses.net/files/2.2%20PK.pdf.
[38]
Intel. Page Modification Logging for Virtual Machine Monitor White Paper. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/page-modification-logging-vmm-white-paper.pdf.
[39]
Daniel Jacobowitz. ptrace() event tracing. https://lwn.net/Articles/10369/.
[40]
Ahmed Khawaja, Joshua Landgraf, Rohith Prakash, Michael Wei, Eric Schkufza, and Christopher J. Rossbach. Sharing, protection, and compatibility for reconfigurable fabric with amorphos. In Symposium on Operating Systems Design and Implementation (OSDI), Carlsbad, CA, 2018.
[41]
Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu. Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. In International Symposium on Computer Architecture (ISCA), 2014.
[42]
Andi Kleen. Machine check handling on Linux. https://www.halobates.de/mce.pdf.
[43]
David Koeplinger, Christina Delimitrou, Raghu Prabhakar, Christos Kozyrakis, Yaqi Zhang, and Kunle Olukotun. Automatic generation of efficient accelerators for reconfigurable hardware. In International Symposium on Computer Architecture (ISCA), 2016.
[44]
Maysam Lavasani, Hari Angepat, and Derek Chiou. An FPGA-based in-line accelerator for Memcached. IEEE Computer Architecture Letters, 2014.
[45]
Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems (TOCS), November 1989.
[46]
Kevin T. Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, and Thomas F. Wenisch. System-level implications of disaggregated memory. In IEEE Symposium on High Performance Computer Architecture (HPCA), February 2012.
[47]
Liu Ling, Neal Oliver, Chitlur Bhushan, Wang Qigang, Alvin Chen, Shen Wenbo, Yu Zhihong, Arthur Sheiman, Ian McCallum, Joseph Grecco, Henry Mitchel, Liu Dong, and Prabhat Gupta. High-performance, energy-efficient platforms using in-socket fpga accelerators. In International Symposium on Field Programmable Gate Arrays (FPGA), 2009.
[48]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2005.
[49]
Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. TABLA: A unified template-based framework for accelerating statistical machine learning. In IEEE Symposium on High Performance Computer Architecture (HPCA), 2016.
[50]
Yandong Mao, Robert Morris, and Frans Kaashoek. Optimizing MapReduce for multicore architectures. Technical Report MIT-CSAIL-TR-2010-020, May 2010.
[51]
Mellanox. Mellanox Innova™ IPsec 4 Lx Ethernet Adapter Card User Manual. http://www.mellanox.com/related-docs/prod_software/Mellanox_InnovaJPsec_4_Lx_Ethernet_Adapter_Card_User_Manual_rev_1_3.pdf.
[52]
Microsoft. Project Catapult. https://www.microsoft.com/en-us/research/project/project-catapult.
[53]
Microsoft. SDN for the Cloud. https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/keynote.pdf.
[54]
David Mulnix. Intel Xeon Processor Scalable Family Technical Overview. https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview.
[55]
Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. Processing data where it makes sense: Enabling in-memory computation. Microprocessors and Microsystems, 2019.
[56]
Vijay Nagarajan and Rajiv Gupta. Architectural support for shadow memory in multiprocessors. In International Conference on Virtual Execution Environments (VEE), 2009.
[57]
Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. Latency-tolerant software distributed shared memory. In USENIX Annual Technical Conference (ATC), July 2015.
[58]
Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia, Henry Mitchel, Suchit Subhaschandra, Arthur Sheiman, Tim Whisonant, and Prabhat Gupta. A reconfigurable computing system based on a cache-coherent fabric. In International Conference on Reconfigurable Computing and FPGAs (ReConFig), 2011.
[59]
OpenCAPI consortium. http://opencapi.org.
[60]
Muhsen Owaida, David Sidler, Kaan Kara, and Gustavo Alonso. Centaur: A framework for hybrid CPU-FPGA databases. In International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017.
[61]
Mark S. Papamarcos and Janak H. Patel. A low-overhead coherence solution for multiprocessors with private cache memories. In International Symposium on Computer Architecture (ISCA), 1984.
[62]
Mathias Payer, Boris Bluntschli, and Thomas R. Gross. Dynsec: On-the-fly code rewriting and repair. In Hot Topics in Software Upgrades, 2013.
[63]
Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. Linearly compressed pages: A low-complexity, low-latency main memory compression framework. In International Symposium on Microarchitecture (MICRO), 2013.
[64]
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In International Symposium on Computer Architecture (ISCA), 2014.
[65]
Daniel J. Scales, Kourosh Gharachorloo, and Chandramohan A. Thekkath. Shasta: A low overhead, software-only approach for supporting fine-grain shared memory. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1996.
[66]
Ioannis Schoinas, Babak Falsafi, Alvin R. Lebeck, Steven K. Reinhardt, James R. Larus, and David A. Wood. Fine-grain access control for distributed shared memory. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1994.
[67]
Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. The dirty-block index. In International Symposium on Computer Architecture (ISCA), 2014.
[68]
Vivek Seshadri, Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry, and Trishul Chilimbi. Page overlays: An enhanced virtual memory framework to enable fine-grained memory management. In International Symposium on Computer Architecture (ISCA), 2015.
[69]
Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In Symposium on Operating Systems Design and Implementation (OSDI), Carlsbad, CA, 2018.
[70]
Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. Distributed shared persistent memory. In ACM Symposium on Cloud Computing (SoCC), 2017.
[71]
Yongming Shen, Michael Ferdman, and Peter Milder. Maximizing CNN accelerator efficiency through resource partitioning. In International Symposium on Computer Architecture (ISCA), 2017.
[72]
Navin Shenoy. A Milestone in Moving Data. https://newsroom.intel.com/editorials/milestone-moving-data.
[73]
David Sidler, Zsolt István, Muhsen Owaida, Kaan Kara, and Gustavo Alonso. doppioDB: A hardware accelerated database. In International Conference on Management of Data (SIGMOD), 2017.
[74]
Mario Smarduch. Enhanced Live Migration For Intensive Memory Loads. https://events.static.linuxfound.org/sites/events/files/slides/CloudOpen-Japan-2015.pdf.
[75]
Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, and Al Davis. Micro-pages: Increasing DRAM efficiency with locality-aware data placement. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2010.
[76]
Bharat Sukhwani, Thomas Roewer, Charles L. Haymes, Kyu-Hyoun Kim, Adam J. McPadden, Daniel M. Dreps, Dean Sanner, Jan Van Lunteren, and Sameh Asaad. Contutto: A novel FPGA-based prototyping platform enabling innovation in the memory subsystem of a server class processor. In International Symposium on Microarchitecture (MICRO), 2017.
[77]
A. Tran, M. Smith, and J. Miller. A hardware-assisted tool for fast, full code coverage analysis. In International Symposium on Software Reliability Engineering (ISSRE), 2008.
[78]
Irina Chihaia Tuduce and Thomas R. Gross. Adaptive main memory compression. In USENIX Annual Technical Conference (ATC), April 2005.
[79]
Haris Volos, Andres Jaan Tack, and Michael M. Swift. Mnemosyne: Lightweight persistent memory. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011.
[80]
Carl A. Waldspurger. Memory resource management in VMware ESX server. In Symposium on Operating Systems Design and Implementation (OSDI), December 2002.
[81]
Emmett Witchel, Josh Cates, and Krste Asanovic. Mondrian memory protection. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2002.
[82]
Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. Mojim: A reliable and highly-available non-volatile memory system. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2015.
[83]
Qin Zhao, Derek Bruening, and Saman Amarasinghe. Efficient memory shadowing for 64-bit architectures. In International Symposium on Memory Management (ISMM), 2010.

Cited By

View all
  • (2024)Object-oriented Unified Encrypted Memory Management for Heterogeneous Memory ArchitecturesProceedings of the ACM on Management of Data10.1145/36549582:3(1-29)Online publication date: 30-May-2024
  • (2024)MemSnap μCheckpoints: A Data Single Level Store for Fearless PersistenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651334(622-638)Online publication date: 27-Apr-2024
  • (2023)Mira: A Program-Behavior-Guided Far Memory SystemProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613157(692-708)Online publication date: 23-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HotOS '19: Proceedings of the Workshop on Hot Topics in Operating Systems
May 2019
227 pages
ISBN:9781450367271
DOI:10.1145/3317550
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Check for updates

Author Tags

  1. FPGA
  2. cache coherence
  3. remote memory

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

HotOS '19
Sponsor:

Upcoming Conference

HOTOS '25
Workshop on Hot Topics in Operating Systems
May 14 - 16, 2025
Banff or Lake Louise , AB , Canada

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)222
  • Downloads (Last 6 weeks)27
Reflects downloads up to 03 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Object-oriented Unified Encrypted Memory Management for Heterogeneous Memory ArchitecturesProceedings of the ACM on Management of Data10.1145/36549582:3(1-29)Online publication date: 30-May-2024
  • (2024)MemSnap μCheckpoints: A Data Single Level Store for Fearless PersistenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651334(622-638)Online publication date: 27-Apr-2024
  • (2023)Mira: A Program-Behavior-Guided Far Memory SystemProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613157(692-708)Online publication date: 23-Oct-2023
  • (2023)Vidi: Record Replay for Reconfigurable HardwareProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582040(806-820)Online publication date: 25-Mar-2023
  • (2023)CXL over Ethernet: A Novel FPGA-based Memory Disaggregation Design in Data Centers2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM57271.2023.00017(75-82)Online publication date: May-2023
  • (2023)Disaggregated Memory in the Datacenter: A SurveyIEEE Access10.1109/ACCESS.2023.325040711(20688-20712)Online publication date: 2023
  • (2022)SemSwapProceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3546591.3547531(9-17)Online publication date: 23-Aug-2022
  • (2022)A case for using cache line deltas for high frequency VM snapshottingProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563481(526-539)Online publication date: 7-Nov-2022
  • (2022)Cache-coherent accelerators for persistent memory crash consistencyProceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3538643.3539752(37-44)Online publication date: 27-Jun-2022
  • (2022)Enzian: an open, general, CPU/FPGA platform for systems software researchProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507742(434-451)Online publication date: 28-Feb-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media