-
Memory Sharing with CXL: Hardware and Software Design Approaches
Authors:
Sunita Jain,
Nagaradhesh Yeleswarapu,
Hasan Al Maruf,
Rita Gupta
Abstract:
Compute Express Link (CXL) is a rapidly emerging coherent interconnect standard that provides opportunities for memory pooling and sharing. Memory sharing is a well-established software feature that improves memory utilization by avoiding unnecessary data movement. In this paper, we discuss multiple approaches to enable memory sharing with different generations of CXL protocol (i.e., CXL 2.0 and C…
▽ More
Compute Express Link (CXL) is a rapidly emerging coherent interconnect standard that provides opportunities for memory pooling and sharing. Memory sharing is a well-established software feature that improves memory utilization by avoiding unnecessary data movement. In this paper, we discuss multiple approaches to enable memory sharing with different generations of CXL protocol (i.e., CXL 2.0 and CXL 3.0) considering the challenges with each of the architectures from the device hardware and software viewpoint.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Memory Disaggregation: Advances and Open Challenges
Authors:
Hasan Al Maruf,
Mosharaf Chowdhury
Abstract:
Compute and memory are tightly coupled within each server in traditional datacenters. Large-scale datacenter operators have identified this coupling as a root cause behind fleet-wide resource underutilization and increasing Total Cost of Ownership (TCO). With the advent of ultra-fast networks and cache-coherent interfaces, memory disaggregation has emerged as a potential solution, whereby applicat…
▽ More
Compute and memory are tightly coupled within each server in traditional datacenters. Large-scale datacenter operators have identified this coupling as a root cause behind fleet-wide resource underutilization and increasing Total Cost of Ownership (TCO). With the advent of ultra-fast networks and cache-coherent interfaces, memory disaggregation has emerged as a potential solution, whereby applications can leverage available memory even outside server boundaries.
This paper summarizes the growing research landscape of memory disaggregation from a software perspective and introduces the challenges toward making it practical under current and future hardware trends. We also reflect on our seven-year journey in the SymbioticLab to build a comprehensive disaggregated memory system over ultra-fast networks. We conclude with some open challenges toward building next-generation memory disaggregation systems leveraging emerging cache-coherent interconnects.
△ Less
Submitted 6 May, 2023;
originally announced May 2023.
-
TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory
Authors:
Hasan Al Maruf,
Hao Wang,
Abhishek Dhanotia,
Johannes Weiner,
Niket Agarwal,
Pallab Bhattacharya,
Chris Petersen,
Mosharaf Chowdhury,
Shobhit Kanaujia,
Prakash Chauhan
Abstract:
The increasing demand for memory in hyperscale applications has led to memory becoming a large portion of the overall datacenter spend. The emergence of coherent interfaces like CXL enables main memory expansion and offers an efficient solution to this problem. In such systems, the main memory can constitute different memory technologies with varied characteristics. In this paper, we characterize…
▽ More
The increasing demand for memory in hyperscale applications has led to memory becoming a large portion of the overall datacenter spend. The emergence of coherent interfaces like CXL enables main memory expansion and offers an efficient solution to this problem. In such systems, the main memory can constitute different memory technologies with varied characteristics. In this paper, we characterize memory usage patterns of a wide range of datacenter applications across the server fleet of Meta. We, therefore, demonstrate the opportunities to offload colder pages to slower memory tiers for these applications. Without efficient memory management, however, such systems can significantly degrade performance.
We propose a novel OS-level application-transparent page placement mechanism (TPP) for CXL-enabled memory. TPP employs a lightweight mechanism to identify and place hot/cold pages to appropriate memory tiers. It enables a proactive page demotion from local memory to CXL-Memory. This technique ensures a memory headroom for new page allocations that are often related to request processing and tend to be short-lived and hot. At the same time, TPP can promptly promote performance-critical hot pages trapped in the slow CXL-Memory to the fast local memory, while minimizing both sampling overhead and unnecessary migrations. TPP works transparently without any application-specific knowledge and can be deployed globally as a kernel release.
We evaluate TPP in the production server fleet with early samples of new x86 CPUs with CXL 1.1 support. TPP makes a tiered memory system performant as an ideal baseline (<1% gap) that has all the memory in the local tier. It is 18% better than today's Linux, and 5-17% better than existing solutions including NUMA Balancing and AutoTiering. Most of the TPP patches have been merged in the Linux v5.18 release.
△ Less
Submitted 28 May, 2023; v1 submitted 6 June, 2022;
originally announced June 2022.
-
Memtrade: A Disaggregated-Memory Marketplace for Public Clouds
Authors:
Hasan Al Maruf,
Yuhong Zhong,
Hongyi Wang,
Mosharaf Chowdhury,
Asaf Cidon,
Carl Waldspurger
Abstract:
We present Memtrade, the first memory disaggregation system for public clouds. Public clouds introduce a set of unique challenges for resource disaggregation across different tenants, including security, isolation and pricing. Memtrade allows producer virtual machines (VMs) to lease both their unallocated memory and allocated-but-idle application memory to remote consumer VMs for a limited period…
▽ More
We present Memtrade, the first memory disaggregation system for public clouds. Public clouds introduce a set of unique challenges for resource disaggregation across different tenants, including security, isolation and pricing. Memtrade allows producer virtual machines (VMs) to lease both their unallocated memory and allocated-but-idle application memory to remote consumer VMs for a limited period of time. Memtrade does not require any modifications to host-level system software or support from the cloud provider. It harvests producer memory using an application-aware control loop to form a distributed transient remote memory pool with minimal performance impact; it employs a broker to match producers with consumers while satisfying performance constraints; and it exposes the matched memory to consumers as a secure KV cache. Our evaluation using real-world cluster traces shows that Memtrade provides significant performance benefit for consumers (improving average read latency up to 2.8x) while preserving confidentiality and integrity, with little impact on producer applications (degrading performance by less than 2.1%).
△ Less
Submitted 16 August, 2021;
originally announced August 2021.
-
Effectively Prefetching Remote Memory with Leap
Authors:
Hasan Al Maruf,
Mosharaf Chowdhury
Abstract:
Memory disaggregation over RDMA can improve the performance of memory-constrained applications by replacing disk swapping with remote memory accesses. However, state-of-the-art memory disaggregation solutions still use data path components designed for slow disks. As a result, applications experience remote memory access latency significantly higher than that of the underlying low-latency network,…
▽ More
Memory disaggregation over RDMA can improve the performance of memory-constrained applications by replacing disk swapping with remote memory accesses. However, state-of-the-art memory disaggregation solutions still use data path components designed for slow disks. As a result, applications experience remote memory access latency significantly higher than that of the underlying low-latency network, which itself is too high for many applications.
In this paper, we propose Leap, a prefetching solution for remote memory accesses due to memory disaggregation. At its core, Leap employs an online, majority-based prefetching algorithm, which increases the page cache hit rate. We complement it with a lightweight and efficient data path in the kernel that isolates each application's data path to the disaggregated memory and mitigates latency bottlenecks arising from legacy throughput-optimizing operations. Integration of Leap in the Linux kernel improves the median and tail remote page access latencies of memory-bound applications by up to 104.04x and 22.62x, respectively, over the default data path. This leads to up to 10.16x performance improvements for applications using disaggregated memory in comparison to the state-of-the-art solutions.
△ Less
Submitted 21 November, 2019;
originally announced November 2019.
-
Hydra: Resilient and Highly Available Remote Memory
Authors:
Youngmoon Lee,
Hasan Al Maruf,
Mosharaf Chowdhury,
Asaf Cidon,
Kang G. Shin
Abstract:
We present Hydra, a low-latency, low-overhead, and highly available resilience mechanism for remote memory. Hydra can access erasure-coded remote memory within a single-digit microsecond read/write latency, significantly improving the performance-efficiency trade-off over the state-of-the-art -- it performs similar to in-memory replication with 1.6X lower memory overhead. We also propose CodingSet…
▽ More
We present Hydra, a low-latency, low-overhead, and highly available resilience mechanism for remote memory. Hydra can access erasure-coded remote memory within a single-digit microsecond read/write latency, significantly improving the performance-efficiency trade-off over the state-of-the-art -- it performs similar to in-memory replication with 1.6X lower memory overhead. We also propose CodingSets, a novel coding group placement algorithm for erasure-coded data, that provides load balancing while reducing the probability of data loss under correlated failures by an order of magnitude. With Hydra, even when only 50% of memory is local, unmodified memory-intensive applications achieve performance close to that of the fully in-memory case in the presence of remote failures and outperform the state-of-the-art solutions by up to 4.35X.
△ Less
Submitted 28 May, 2023; v1 submitted 21 October, 2019;
originally announced October 2019.