4.2 Experimental Setup
Our evaluation testbed consists of 2x dual-socket Intel Xeon Gold 5218 servers at 2.3GHz with 32 cores, 64GB DDR4 DRAM. All nodes run Ubuntu 20.04 with Linux kernel version 5.4. Each node is equipped with an NVIDIA BlueField-2 DPU. The DPU has 8x Armv8 A72 cores with 6MB shared L3 cache, 16GB DRAM, and 100Gbps RDMA NIC. For the storage nodes, we equip four 480GB Samsung 983 Zet SSDs and configure the DPU to offload the NVMe-oF targets to the RDMA NIC. The two nodes are connected to a 100GbE switch and we use RoCE for RDMA.
System configurations. Our natural Baseline is vanilla RocksDB (v. 6.4.9), which runs all compactions on the compute node’s CPU. Unless otherwise stated, all tests with RocksDB use the following configurations. The compression algorithm is set to zlib. The bloom filter is turned on. Direct IO is turned on to eliminate the impact of the OS page cache. We vary multiple parameters in our experiments, including the number of compaction threads and the key-value size. The remaining parameters are set to RocksDB defaults.
Comparison groups. We also set up several comparison groups to verify our optimization techniques.
Baseline+HC indicates that the (de)compression steps of compaction are offloaded to the accelerator of the DPU, but the rest of the steps still run on the host CPU.
Naive-Off uses NFS service to achieve data file sharing across CPU and DPU, and it naively offloads all compaction jobs to DPU without
hardware-assisted compression (HC).
DComp utilizes the DPU-aware file system to support offloading and turn on the HC, but it only offloads compaction to the compute node’s DPU. While
D\({^2}\)Comp further supports offloading compactions to the storage node’s DPU and implements a
resource-aware dispatching (RD) policy across the host CPU, storage-side DPU, and compute-side DPU, as described in
3.2.
Workloads and Datasets. We run tests using two popular KV workloads, YCSB and db_bench. For most of the tests with db_bench, we follow the common practice of loading a 100GB database with a randomly generated key using a 16-byte key size and a 1KB value size as the initial state. The run-time phase makes 20M requests.
4.3 Evaluating the Single Compaction Task
To evaluate the impact of
D\({^2}\)Comp on the single compaction performance, we prepare 10M KV records with varying key and value sizes and manually trigger a compaction to merge them. In this test,
DComp represents the case of offloading the entire compaction job to the compute node’s DPU, while
D\({^2}\)Comp represents the case of offloading the entire compaction job to the storage node’s DPU, and both cases turn on the HC. Figure
11 compares the efficiency of different comparison groups in executing a single compaction job.
Compaction throughput. Figure
11(a) shows the compaction throughput, in terms of the merged input KV records per second. We observe that with the hardware-assisted compression (HC) turned on, the compaction performance is significantly higher across all KV sizes. Take the commonly used key-value size of 16bytes-1KB as an example, the performance improvement of
Baseline+HC over
Baseline is
\(3.81\times\). This performance gain mainly comes from the hardware acceleration of the (de)compression steps in the compaction. Figure
11(b) shows the execution time breakdown, the time used for (de)compression is significantly decreased after offloading them to the deflate accelerator.
Naive-Off has a lower performance compared to
Baseline because after offloading to the DPU, the execution time of all computational steps such is considerably increased due to the slower speed of the Arm kernel.
DComp and
D\({^2}\)Comp outperform
Naive-Off because the accelerator of the DPU speeds up the (de)compression steps. Additionally, compared to
Naive-Off, both
DComp and
D\({^2}\)Comp exhibit shorter I/O times. This is primarily because the DPU-aware file system reduces redundant data movement. However, this performance gain is not as pronounced as that achieved by accelerators, as the compaction performance is mainly constrained by computation. From this experiment, we conclude that hardware-assisted compression is critical to improve compaction performance.
Impact of KV size. From Figure
11(a), we can also observe a phenomenon that the larger KV sizes benefit more from the HC than the small KV sizes. For example, with the 16bytes-128bytes and 16bytes-32bytes key-value sizes,
Baseline+HC only achieves an improvement in
\(3.34\times\) and
\(2.86\times\) throughput over
Baseline, respectively. This is mainly due to the fact that SSTable contains more KV pairs for a smaller KV size, and thus the time spent on data merging and data reordering is longer. For the (de)compression steps, the KV size has little effect on its execution time because it operates at the fixed-size block granularity.
Resource consumption. Figure
11(c) further shows the host CPU utilization and network traffic of different comparison groups. Executing a single compaction task on the host CPU consumes an entire core and brings about 7.6 GB of network traffic as indicated by the
Baseline.
Baseline+HC has similar data traffic with a lower host CPU usage compared to the
Baseline because the (de)compression operations are offloaded to the accelerator. The
Naive-Off eliminates the host CPU footprint by offloading the compaction to the DPU. However, it causes double network traffic since the data needs to be read to the CPU first before being passed to the DPU. This is also confirmed in Figure
11(b), which shows that the IO time of
Naive-Off is significantly higher than the other groups.
DComp eliminates the redundant data traffic by enabling DPU to read and write data directly from SSD. However, it still needs to fetch data from the storage node to compute-side DPU, the network data movement is still unavoidable.
D\({^2}\)Comp further eliminates this data movement by offloading the compaction to the storage node’s DPU.
4.4 Evaluating the Overall Performance
In this section, we use the micro-benchmark db_bench to evaluate the impact of different comparison groups on overall read and write throughput, latency, and resource consumption. In this test, Naive-Off+HC denotes offloading all compaction tasks to the DPU with HC turned on, while DComp and D\(^2\)Comp leave \(L_0\)-\(L_1\) compaction on the host CPU and offload \(L_2\)-\(L_n\) compaction to the DPUs of the compute node and storage node, respectively.
Write throughput. Figure
12(a) shows the scalability of random write throughput by increasing the number of background compaction threads. We can see that the benefits of hardware-assisted compression on single compaction almost reflect on the throughput improvement in this write-intensive workload. It brings
\(2.98\times\) to
\(3.95\times\) throughput improvement over the
Baseline under various numbers of compaction threads. In our tests, performance no longer increases when the number of compaction threads increases to 8 for the
Baseline. This is because adding CPU threads only reduces CPU competition for concurrent compaction jobs at different levels. When the CPU resources are enough to cope with the burst compaction jobs (e.g., each compaction job runs on a different CPU core), continuing to increase them brings no more benefits. The ultimate performance is bound by the execution speed of a single compaction job on a single CPU core, which is limited by data-intensive compression. With our HC, this limitation is broken, and therefore the overall performance is significantly improved. Nevertheless, naively offloading all compaction jobs to the DPU (i.e.,
Naive-Off+HC) reduces the performance gains because other computational steps are slowed down due to the slow Arm cores.
DComp and
D\({^2}\)Comp overcome this problem by leaving performance-critical
\(L_0\)-
\(L_1\) compaction jobs to the fast host cores, which maximizes the retention of performance gains at the cost of a little host CPU consumption and network traffic.
Baseline + HC slightly outperforms
D\({^2}\)Comp because it performs all compaction jobs on fast host cores. However, this approach consumes a large amount of host CPU resources and causes significant network amplification as shown in Figure
12(c).
Scalability. Figure
12(b) further shows the scalability of the different comparison groups with respect to the number of rocksdb instances. The y-axis represents the average write throughput of all instances. Each instance sets the compaction threads to 16. The
Baseline and
Baseline+HC have a steady average throughput with increasing number of instances, as there are enough host CPU cores for compaction. However, executing all compaction jobs on the host could cause severe CPU contention with the foreground workload and bring network traffic. The
Naive-Off+HC eliminates the host CPU footprint and reduce network traffic by offloading all compaction jobs to DPUs. However, the average throughput decreases gradually as the number of instances increases. This is mainly because the compaction jobs for multiple instances competes for the limited number of Arm cores on the DPU. In contrast,
D\({^2}\)Comp and
DComp maintain the performance gains by leaving performance critical
\(L_0\)-
\(L_1\) compaction jobs on the host CPU and dynamic offloading
\(L_2\)-
\(L_n\) compaction jobs to DPUs. In this test, we can also observe that even with four instances, each having 16 concurrent compaction threads,
D\({^2}\)Comp can still maintain a high write throughput. This demonstrates that the DPU is capable of handling a large number of
\(L_2\)-
\(L_n\) concurrent compaction jobs with little impact on write performance. In addition, when higher write loads cause the DPU to become overloaded, newly generated tasks are scheduled to be executed on the compute nodes, thus ensuring performance benefits.
Network amplification. Figure
12(c) shows the network amplification of different comparison groups when randomly loading datasets of different sizes with 8 compaction threads. The network amplification of
Baseline,
Baseline+HC and
DComp gets higher as the amount of written data increases because more and more generated compaction jobs need to read data from the storage node and write it back. Naively offloading all compaction jobs to the storage-side DPU (i.e.,
Naive-Off+HC) can even lead to higher network amplification, almost
\(2\times\) as much as
Baseline at 500GB datasets. This is because these jobs still need to fetch files through the NFS service on the host CPU, resulting in redundant data movement. In contrast,
D\({^2}\)Comp eliminates these data movements by allowing the DPU to read and write files from the SSD directly and reduces network traffic by offloading the
\(L_{2}\)-
\(L_{n}\) compactions to the DPU of the storage node.
Read throughput. For read-only workloads, since compaction is not triggered, the offloading policy does not have an impact on their performance, and their performance is mainly affected by HC. Our test shows that D\({^2}\)Comp increases sequential and random read throughput by \(1.13\times\) and \(1.28\times\), respectively, compared to baseline. This improvement mainly comes from the hardware acceleration of the decompression step of read operations. Overall, this experiment demonstrates that decompression also influences the read performance of LSM-tree on fast storage devices. D\({^2}\)Comp reduces this overhead by offloading it to the accelerator, thus benefiting the read performance.
Latency. Table
1 presents the latency of fillrandom and readrandom benchmarks.
D\({^2}\)Comp effectively reduces the average latency of write and read operations because it speeds up the (de)compression steps of compaction and read operations. The reduction in the p99 latency of the write operation mainly owes to the reduced write stalls, since the
\(L_{0}\)-
\(L_{1}\) compaction is sped up [
6].
4.5 YCSB Benchmarks
Now we compare the CPU-only
Baseline and
D\({^2}\)Comp using the YCSB benchmarks. We set the background compaction thread pool to 8. Figure
13 reports the YCSB results using the following workloads:
workload a (50% read and 50% update),
workload b (95% read and 5% update),
workload c (read-only),
workload d (95% read and 5% insert),
workload e (95% scan and 5% insert) and
workload f (50% read-modify-write latest records and 50% random reads).
In these tests, our proposal manages to improve the throughput and reduce the host CPU utilization. Reads dominate all these workloads, resulting in non-frequent compactions in the underlying KV store. Hence, the throughput improvement is not as high as the write-intensive workloads with our proposal. We observe 30.8%, 20.2%, 15.6%, and 23.0% gains on throughput for workload a, workload b, workload d, and workload f, respectively. Both these workloads contain non-trivial writes. For the read-only workload c and scan-intensive workload e, our proposal also get 47.0%, and 22.1% throughput gains respectively due to the hardware-assisted acceleration for the decompression step during reads. Since all these workloads are read-driven, RD is rarely triggered, the improvement mainly comes from the HC. In addition, our proposal reduces the tail latency under all these workloads by up to 16.9%. Overall, D\({^2}\)Comp also benefits the macro benchmarks.
4.6 The Impact of Configurations
In this section, we investigated the impact of turning off compression and using a hierarchical compression strategy [
34], i.e., no compression for the top two levels, fast lz4 compression for the middle two levels, and slow zlib compression for the last level. We set the number of compaction threads to 16 and evaluate the overall throughput of various compression strategies when the host CPU resources are sufficient (16 cores for compaction) and insufficient (1 core for compaction).
Table
2 shows the result. Turning off compression improves overall performance because it removes computational overheads for compression. However, this comes at the cost of increased storage costs, with a space consumption of 77.34GB. Given the high cost of fast NVMe SSDs, reducing the storage footprint is necessary [
11,
17,
23]. Using a hierarchical compression strategy reduces the storage footprint to 42.33GB with a small degradation in performance, since the multi-threaded performance is mainly bounded by the
\(L_{0}\)-
\(L_{1}\) compaction. However, both
NoCompress and
TierCompress strategies can have significant performance reduction with limited host cores. This is because operations such as data reordering and merging can still cause high computational overhead when compression is turned off. They consume significant CPU resources and cause resource contention.
TierCompress drops more than
NoCompress because it opens up the compression for
\(L_{2}\)-
\(L_{n}\) levels, further exacerbating the CPU load.
D\({^2}\)Comp mitigate the performance slowdown by offloading overloaded compaction jobs to DPUs, reducing host CPU contention and maintaining the overall throughput. Overall,
D\({^2}\)Comp still benefits the write performance of the LSM-tree when the compression is turned off since it effectively mitigates host CPU contention.