We add a thin layer in the SPDK framework [
50] to implement eZNS and realize the
v-zone concept. The primary reason for choosing the SPDK approach was its ease of implementation and integration into the software stack of a storage server accessible by remote clients. Moreover, the SPDK-based design can also be used in a local system to serve virtual machines through the SPDK vhost extension. This approach allows the storage server to provide efficient and high-performance I/O operations while remaining compatible with existing software stacks. We use the same test environment as in Section
3.1. Non-SPDK applications require a standard ZNS block device exposed via the kernel NVMe driver; thus, we set up eZNS as a disaggregated storage device over RDMA (NVMe-over-RDMA) and connect to it using the kernel NVMe driver.
Default v-zone Configuration. By default, eZNS creates four namespaces (NS1–4), each of which is allocated 32 essential and 32 spare resources. Since each namespace provides a maximum of 16 active zones, the minimum stripe width for
v-zone is 2 with a stripe size of 32 KB. However, eZNS can overdrive the width up to 16 with a stripe size of 4 KB. For a fair comparison, we prepare a static logical zone configured with stripe width and size of 4 KB and 16 KB, respectively; hence, it also accesses full device capability when the application populates enough active logical zones. Both a
v-zone and a static logical zone comprise 16 physical zones. Different configurations are used for single-tenant evaluation (single namespace) and the YCSB benchmark (six namespaces), as specified in Section
5.3.
5.1 Zone Ballooning
We demonstrate the efficiency of zone ballooning when handling large writes (i.e., 512 KB I/O with a queue depth of 1). First, within a namespace, we compare the performance between a
v-zone and a static logical zone, where the number of writers is configured to 4, 8, and 16, respectively. Each writer submits a write I/O to different zones. Our
local overdrive operation can reap more spare zones and lead to better throughput. As shown in Figure
26, the
v-zone outperforms the static one by 2.0× under the 4-writer case as 4 static logical zones enable only 16 physical zones while 4
v-zone overdrive the width to 8 and expand to 32 physical zones. In the 8-writer and 16-writer cases,
v-zone reduces the overdrive width accordingly and utilizes the same number of physical zones (32 and 64, respectively) with the static logical zone.
To evaluate eZNS’s adaptiveness under dynamic workloads, we set up overdriven zones from different namespaces. The first three namespaces (NS1, NS2, and NS3) run two writers, whereas the fourth namespace (NS4) runs eight. NS1, NS2, and NS3 stop issuing writes at
t = 30 seconds and resume the writing activity at
t = 80 seconds. We measure the throughput and spare zone usage of four zones for a 100-second profiling window (Figures
27 and
28). When the other three zones become idle, the
v-zone from NS4 takes up to 3× more spare zones from other namespaces using the
global overdrive primitive and maxes out its write bandwidth (
\(\sim\)2.3 GB/s). It can then quickly release the harvest zones when other zones start issuing writes again.
5.2 Zone I/O Fairness
We evaluate our I/O scheduler in various synthetic congestion scenarios by placing competing zones in the same physical die group. We compare the performance of all co-located zones when enabling and disabling our mechanism. The zone ballooning mechanism is turned off for all cases. We report per-thread bandwidth in Figure
29.
Read-Read Fairness. We run a sequential read of 128 KB I/O size at two types of zones on co-located dies. To equally load the physical dies, we populate more threads for lower-width zones. For example, a zone with a width of 2 runs four threads on each stripe group, whereas a zone with a width 8 has only one thread. As shown in Figure
29(a), in scenario 1, when disabling our congestion control mechanism, Zone A (configured with stripe width 2 and stripe size 32 KB, QD-1) and Zone B (configured with stripe width 8 and stripe size 8 KB, QD-32), even holding the same sized full stripe, achieve 76 MB/s and 1,287 MB/s, respectively. This is because the zone with the higher QD dominates on the competing die. Our scheme effectively controls the per-zone window size and ensures that each zone submits the same amount of outstanding bytes. Hence, both Zone A and Zone B sustain 290 MB/s. In scenarios 2 and 3, we change the Zone A stripe configuration to <stripe width 4, stripe size 16 KB, QD-1> and <stripe width 8, stripe size 8 KB, QD-1), and observe similar behavior when turning off the read congestion logic. In scenario 3, the congestion level on the die gets lowered as Zone A only submits one 128 KB I/O (which was 4 and 2 in scenarios 1 and 2, respectively). Hence, the read latency also becomes below the threshold, and the I/O scheduler chooses to max out the bandwidth.
Write-Write Fairness. We carefully create different write congestion scenarios and see how our admission control operates. The workload used is a sequential write of 512 KB size. In the first scenario, we co-locate 16 regular write zones (Zone A, where each has a striping width of 8 with 8 KB stripe size and submits write I/Os at 5 ms intervals, sustaining 95 MB/s maximum throughput) with a busy writer (Zone B, which has width 2 and 32 KB stripe size, submits I/O without interval delays, achieving 85 MB/s at most). Figure
29(b) reports the bandwidth utilization of 1 regular zone (Zone A) and the busy writer (Zone B). Our admission control mechanism limits the write issuing rate of Zone B and gives more room at the write cache to the regular zone (Zone A), leading to 35.7% bandwidth improvement per thread. Next, we set up a highly congested case by changing 16 regular zones to busy writers (scenario 2). Without the admission control, Zone B runs at 64.9 MB/s, which is 32.5 MB/s per physical zone or 76.3% of the physical zone bandwidth, whereas Zone A receives only 16.4 MB/s per physical zone or 38.4% of the physical zone bandwidth. As described in Section
4.5.2, our scheme equally distributes the write bandwidth share across competing zones, and Zone B receives 56.8% of the total bandwidth of 2 physical zones, increasing the bandwidth of Zone A by 7.6%. As a result, it improves the overall bandwidth of the device from 2,160.9 MB/s to 2,304.3 MB/s, or by 6.6%. The last scenario is a collision-less one at the die level where we eliminate the overlapping region among all write zones by populating active physical zones lesser than the number of dies (i.e., reducing the number of regular zones to 15). Similarly, when enabling the admission control, the bandwidth allocated for Zone B slightly decreases (
\(\sim\)7.2%) to avoid cache congestion, and the overall device bandwidth is increased by 24.7% (from 2,403.3 MB/s to 2,997.7 MB/s).
Read-Write Fairness. We examine how our congestion control mechanism coordinates with the admission control when handling read/write mixed workloads. In this experiment, we set up three types of zones: (1) ×16 regular readers (Zone A), where each has a striping width of 2 and 32 KB stripe size, performing 128 KB random read at queue depth 32, across all physical dies; (2) 1 busy writer (Zone B), whose striping width is 2 with 32 KB stripe size; and (3) ×16 regular writers (Zone C), which has a striping width of 8 and 32 KB stripe size each, submitting I/Os under a 5 ms interval. Both B and C issue 512 KB large writes. Figure
29(c) reports their per-thread bandwidth. When disabling our scheduler, each reader achieves 199.6 MB/s but writes are jeopardized significantly, as Zone B and Zone C can only achieve 19.3% and 27.3% of their maximum bandwidth. As we gradually turn on our mechanisms, the congestion control shrinks the window size such that more bandwidth is allocated to the writes. Further, the admission control then equally partitions bandwidth among competing writing zones. As shown in the CC+AC case, zones A, B, and C can sustain 71.6%, 57.5%, and 70.1% of their maximum bandwidth capacity, respectively.
5.3 Application: RocksDB
To evaluate eZNS in a real-world scenario, we use RocksDB [
41] over the ZenFS storage backend. In addition to the built-in utility in the RocksDB
db_bench tool, we port YCSB workload generators [
4] for the mixed workload evaluation.
Single-Tenant Performance. First, we run the
overwrite profile of the
db_bench to evaluate the write performance of eZNS. Figure
30 demonstrates that eZNS improves the throughput by 46.1% and 84.5% with
local and
global overdrive, respectively. The ZenFS opens all available zones regardless of actual usage; hence, our
local overdrive has minimal impact, and the stripe width, 4, becomes the same as with static zones. However, our I/O scheduler mitigates intra-namespace interference, and each zone receives a fair share of the bandwidth, eliminating unnecessary application delays due to zone interference. When
global overdrive comes in, zones further harness more active resources and attain higher bandwidth.
Next, we evaluate the performance of a single tenant using the
readwhilewriting profile of the
db_bench, which runs one writer and multiple readers. This workload profile demonstrates a read/write mixed scenario. In the case of a single-tenant configuration, eZNS creates a single namespace on the device and allocates 128 essential and 128 spare resources to it. Since only two stripe widths, 8 and 16, are possible in this configuration, eZNS sets the stripe size to 16 KB for the width of 8 to avoid the namespace running only on large stripe sizes. We compare the performance of eZNS over two static configurations, both with a stripe width of 16 but with different stripe sizes of 4 KB and 16 KB. Since there is only one namespace on the device, eZNS always overdrives
v-zones to the width of 16, which is identical to the static configurations. Therefore, both the static namespace and eZNS can exploit all available bandwidth on the device. However, the I/O scheduler of eZNS helps mitigate interferences between zones and improves overall application performance. Figure
31 shows that eZNS improves the p99.9 and p99.99 read latency by 28.7% and 11.3% over the static configurations with a stripe size of 16 KB and 4 KB, respectively. Additionally, eZNS also improves the throughput by 11.5% and 2.5% with a stripe size of 4 KB and 16 KB.
Multi-Tenant Performance. Next, we set up instances of
db_bench on four namespaces (A, B, C, and D), each with a different workload profile. A and B perform the
overwrite profile, whereas C and D execute
randomread concurrently. We run the benchmark for 1,800 seconds and report the latency and the throughput. Figure
32 shows that our I/O scheduler significantly reduces p99.9 and p99.99 read (C/D) latency by 71.1% and 20.5%, respectively. In terms of throughput, eZNS improves write (A/B) and read (C/D) throughput by 7.5% and 17.7%, respectively. Furthermore, while the read latency and throughput are improved, the write latency is either maintained at the same level or decreased compared to the static configuration because eZNS moves the spare bandwidth from read-only namespaces (C/D) to write-heavy ones (A/B) (Figure
33).
Mixed YCSB Workloads with Four Namespaces. YCSB [
15] is widely used to benchmark realistic workloads. In our experiments, we run YCSB workload profiles A, B, C, and F on each of the six namespaces. We exclude YCSB workload profiles D and E because they increase the number of entities in the DB instance during the benchmark. As YCSB-C (read-only) does not submit any write I/Os during the benchmark, eZNS triggers
global overdrive and rebalances the bandwidth to the write-most namespaces (A and F). Figure
34 shows that the I/O scheduler improves the p99.9 read latency of read-intensive workloads (YCSB B and C) and also the read-modify-write one (YCSB F) by 79.1%, 80.3%, and 76.8%, respectively. The throughput improvement from global overdrive is up to 10.9% for the write-most workload A in Figure
35.
Mixed YCSB Workloads with Six Namespaces. We also conducted evaluations using all six workload profiles of YCSB (A–F) with a configuration involving six namespaces. To support six namespaces, we reduced the maximum active zones to 11, allocating 22 essentials and 20 spares for each namespace during the initialization process. The remaining four physical zones are designated as part of the global spare pool.
Figures
36 and
37 present the tail latencies and throughput comparisons between eZNS and the static zone configuration. For YCSB A through C and F, our results closely resemble those observed in the four-namespace scenario. Notably, YCSB A demonstrates the most significant improvement in throughput, with an increase of 8.6%, whereas the read-heavy workloads (YCSB B, C, and F) exhibit remarkable reductions in p99.9 read latency, with improvements of up to 77.6%. YCSB D, a read-intensive profile focusing on the latest data, also showcases notable improvements, with p99.9 latency reduced by 76.9% and a throughput increase of 4.8%. In contrast, YCSB E, which represents a range-scan workload, demonstrates the least improvement among the six workload profiles. Although the p99.9 read latency for YCSB E is reduced by 44.3%, its throughput remains slightly below that of the static zone at 99.0%. In addition, the p99.99 read latency is worse than the static zone configuration. This is primarily due to YCSB E’s higher per-operation cost and increased bandwidth of other tenants. The scan operation of YCSB E generates a large number of read I/Os per operation, having more congested I/Os per operation than other workload profiles. At the same time, the increased device bandwidth further raises the chance of congestion on accessing dies. As a result, the worst-case latency could be higher than the static configuration. If we keep the throughput of tenants same as the static configuration, p99.99 will be dramatically decreased as well.
eZNS on a File System (F2FS). To evaluate the performance of eZNS on a general file system, we replicated the scenario with four namespaces using RocksDB over F2FS [
30] instead of ZenFS, while maintaining an identical zone configuration to that of ZenFS. The read-intensive workloads (YCSB B, C, and F) demonstrate improvements in both p99.9 read latencies and throughput, as illustrated in Figures
38 and
39.
However, YCSB A does not benefit from eZNS and even performs worse than the static zone configuration, achieving a throughput of only 95.3%. This can be attributed to the lower zone utilization of F2FS. We observed that F2FS opens up to three zones but allocates only one zone for writing user data, resulting in the lower write bandwidth for both eZNS and the static zone. Additionally, since we have tuned the maximum active zones to 16 and the striping size cannot be sized smaller than 4 KB, eZNS cannot increase the striping width beyond 8. Consequently, the global overdrive mechanism does not operate effectively in this scenario, forcing eZNS to make a tradeoff, sacrificing a small amount of throughput from YCSB A to ensure a fair distribution of read bandwidth across all namespaces.
5.4 Overhead Analysis
End-to-End Read Latency Overhead. Since eZNS serves as an orchestration layer between the physical ZNS device and the NVMe-over-Fabrics target, there may be some overhead when the I/O load is very low. To measure this overhead, we conducted a quantitative analysis using 4 KB random read I/Os and compared it with host-managed zone access, where the host directly accesses the physical device without eZNS. Figure
40 demonstrates that eZNS does not add a noticeable latency overhead for I/O depths up to 8. As the I/O depth goes over 16, up to 14.0% overhead is observed due to the I/O scheduler delaying the I/O submission. However, the scheduler provides significant advantages in real-world scenarios, as shown in previous experiments.
Memory Footprint. eZNS relies on in-memory data structures for managing v-zone metadata, including the logical-to-physical mapping and scheduling statistics. Additionally, it maintains a copy of the physical zone information to reduce unnecessary queries to the device, enabling faster zone allocation and deallocation. In our current implementation, the size of v-zone metadata is less than 1 KB, and the size of physical zone information is smaller than 64 bytes. For our testbed SSD with four namespaces, each with 1TB of capacity, v-zone metadata and physical zone information require 2 MB and 2.5 MB of memory, respectively. Compared to the memory requirements of the page mapping in conventional SSDs, the memory usage of eZNS is negligible.