This section presents the experimental results of
Yahoo! Cloud Serving Benchmark (YCSB) [
11] and TableFS [
32,
34], a real-world KV application that relies on range queries. We compare KVRangeDB against two other solution: Wisckey [
23], the state-of-art software KV-store on block devices, and RocksDB [
14], the industry counterpart, ported to KVSSD. We analyze how each optimization technique presented here contributes to the overall performance improvement and how they impact different collections of KV operations.
4.2 Results for YCSB
We use two datasets for YCSB experiments: The first dataset of 250 million large records (with 16-B key and 4000-B value size) does not leverage packing; second dataset of 1 billion small records (with 16-B key and 1000-B value size [
38]) can leverage packing (we pack four logical records to form a physical records). For all of our experiments, we first load all the data on the device (the index is written with the data). We then run different query workloads to examine the performance of KVRangeDB, RocksKV, and Wisckey. For KVRangeDB, the bloom filter filter described in Section
3.3 is constructed during the loading phase and persisted to the KVSSD when database is closed. However, we bypass the bloom filter checking in the get workload, since it is either fully packed or unpacked.
Write performance. Figure
8(a) demonstrates the throughput performance of loading data onto the device. For smaller records, packing can be useful in improving the overall write throughput and reducing the number of keys managed by the device as we discussed in Section
3.2. The loading throughput of KVR-PK outperforms RocksKV by 14
\(\times\) and Wisckey by 1.3
\(\times\). It is also worth noting that RocksKV requires greater compaction I/O, since it packs keys and values together. Packing more records into a physical record yields higher write throughput, and thus it enhances the data loading efficiency. KVR-PK is beneficial for write-heavy use cases that contain lots of small records. For 4000-B value size, KVR can achieve 18.8
\(\times\) better performance compared to RocksKV. KVR performs slightly (
\(\sim\)15%) worse than Wisckey in terms of write operations, as Wisckey leverages large sequential I/O for writes. However, Wisckey’s implementation suffers on removes and updates (which require host-side garbage collection), contrary to KVRangDB that can directly remove and update records from device through the user key. We evaluate remove performance as part of the file system workloads in Section
4.3.
Point query. For RocksKV, a get operation requires examining several sorted-runs in each level of the LSM-tree to finally retrieve the records, introducing multiple I/Os. Wisckey needs to look up the LSM-Tree for the log offset of a record based on user key before retrieving the value from the log. In contrast, KVRangeDB without packing (KVR) can fulfill the get request by a single I/O using the user key through the KV interface provided by the device. Similarly to Wisckey, KVR-PK only requires traversing a small LSM-tree to translate the logical key to physical key and then retrieve the value from the device using the physical key. Hence, a small index cache is enough to help reduce the I/O overheads from index lookup.
Figure
9(a) and (b) demonstrates the performance of simple
get (or
point query) workload. KVR exhibits a large advantage over RocksKV for both no cache and 1-GB cache scenarios. KVR outperforms Wisckey for large records by 73% (no cache) and 39% (1 GB cache). KVR-PK provides slightly lower performance than Wisckey with 1000 B value size, because the block device provides better read performance compared to KVSSD.
Scan keys. For the scan key workload, KVRangeDB only needs to traverse a relatively small LSM tree only containing keys. By contrast, RocksKV’s LSM-Tree comprises both keys and values, which may require more I/Os. KVR-PK/KVR achieve much better performance,
\(\sim 8\times\) better compared to RocksKV with 1 GB cache as shown in Figure
10(a) and (b). KVR-PK/KVR perform slightly worse compared to Wisckey due to the device read performance disadvantage of KVSSD (Wisckey also only needs single I/O to retrieve value after locating the log offset).
Some may wonder if scanning the keys only (without retrieving values) makes sense in real-world applications. Here is an example of a typical file system workload (more details in Section
4.3): Consider the command line utility
ls, which lists files and sub-directories. In TableFS, a
ls -l $path command translates to a scan on the target directory that needs to retrieve value (calling both
key() and
value()) for parsing stats in the inode. However, a simple
ls $path command only needs to iterate on the keys without reading the value (inode information).
Scan keys and values. On the flip side, KVRangeDB does not perform equally well with range queries that retrieve values, since it costs a separate I/O for each
value() operation. As shown in Figure
10(c) and (d), when the scan length passes 40, KVR-PK-PF/KVR-PF perform worse than RocksKV. The optimization of value prefetch with user hints improves the performance to some extent (
\(\sim\)56%). From the analysis of real key–value workloads [
41], the average scan length is less than 20. Therefore, it may not be worth packing key and value together like RocksKV, which mostly benefits longer scans with value retrieval (
value() operation).
4.3 Results for TableFS
For the file system workloads [
2,
33,
37,
40], we use a real file system trace from Los Alamos National Lab that contains approximately 500 million files and directories (
\(\sim\)20 million directories and
\(\sim\)480 million files), and
\(\sim\)90% files are marked as “cold,” which can leverage our hybrid packing technique described in Section
3.2. The loading and aging phase consists of multiple file operations such as path resolve, opendir, mkdir, mkmod, unlink, chmod, and so on, which translates into a combination of
put,
get,
delete workloads to the KV-store. At the end of each aging round, we perform a value log garbage collection for Wisckey (around 25% difference between real metadata capacity and actual storage usage). Value prefetching is enabled for range queries for both Wisckey and KVRangeDB variants.
For KVR-PK-PF, we selectively pack multiple file inode records (which are marked as cold set) under the same directory into a single physical record as described in Section
3.2. Since the files in the same directory are loaded together, such packing can benefit range queries as discussed in Section
3.2. For the remaining
\(\sim\)10% hot files, we do not perform packing, and the values (inode information) can be directly retrieved from the device through logical/application keys.
Load file system tree. Figure
11 presents the results of loading the file system tree into TableFS. KVR-PK-PF yields a 33.9
\(\times\) speedup over Rockskv and 1.14
\(\times\) over Wisckey, respectively. Besides, KVR-PK-PF also reduces CPU consumption by 15
\(\times\) and 1.5
\(\times\), respectively. We also collect the number of I/O requests and read/write amplifications from/to the device. RocksKV incurs significantly larger write amplification, 15.7
\(\times\) worse than KVR-PK-PF, due to constant compaction of the sorted-runs. KVR-PK-PF also mitigates the read amplification enormously, specifically over 2000
\(\times\) fewer than RocksKV and 14
\(\times\) fewer compared to Wisckey, from the direct
get interface on the device. KVR-PK-PF performs slightly worse than KVR-PF; however, it reduces CPU cost by 12% (due to less number of write I/Os).
Aging the file system. Figure
12 demonstrates the results of aging the TableFS file system tree. KVR-PK-PF outperforms RocksKV and Wisckey by 72
\(\times\) and 23.7
\(\times\), respectively. Moreover, KVR-PK-PF also saves CPU cost by 55.6
\(\times\) and 14.3
\(\times\), respectively. The main negative factor of Wisckey is the value log garbage collection caused by records update [
15,
23]. Wisckey issues a larger number of read I/Os, because it needs to lookup the key to log offset mapping for every get operation (check file path existence) and also performs garbage collection after removes and updates of the records. KVR-PK-PF greatly reduces read and write amplification by 385
\(\times\) and 9.8
\(\times\) compared to Wisckey. This advantage is mainly attributed to using the direct key value interface on the KV devices to store values that effectively offloads the value log garbage collection from the host to the device.
Metadata-intensive operations. Figure
13 shows the performance and read I/O results for metadata-intensive file system workloads. We use a limited number of CPU resources (four and eight physical cores) to emulate the resource competition common in multi-tenant scenarios. We assign 16/32 client threads for each physical core.
Parallel
find workloads perform traversal of the files/directories in a breadth first search fashion. These workloads contain path lookup and readdir operations that translate to
get and range queries. KVR-PK-PF yields
\(\sim 5.1\times\) better performance on average compared to RocksKV and reduces CPU cost by a factor of 3.9
\(\times\). This is because in a real file system directory tree, there are lots of directories with very few sub-directories and files (leading to short scans). Wisckey outperforms KVR-PK-PF by
\(\sim\)30%, simply because current block SSD has much higher read IOPS performance (
\(\sim 3\times\)) as shown in Table
2 and better latency characteristics [
15].
Parallel “ls -l” contains path lookup and readdir operations that translate to get and range queries with both key() and value() operations with various scan lengths (depending on the number of files and sub-directories within a queried directory). KVR-PK-PF yields \(\sim 5\times\) better performance on average compared to RocksKV and reduces CPU cost by 4\(\times\). KVR-PK-PF slightly improves performance and reduces CPU consumption, since it reduces get I/O operations (\(\sim\)10%) when the queried directory file inodes are packed.
Parallel lstat workload consists of get operations only. Compared to RocksKV and Wisckey that require multiple I/Os per get operation (RocksKV needs to examine multiple sorted-runs or SSTable files, Wisckey needs to lookup the log offset from user key before retrieving the value from the log), KVR only requires a single I/O per get through the KV device interface. Thus, KVR reduce 15.9\(\times\) and 1.9\(\times\), respectively, compared to RocksKV and Wisckey. Besides, KVR outperforms RocksKV and Wisckey by 51\(\times\) and 1.12\(\times\) and reduces CPU usage by 30\(\times\) and 1.15\(\times\). The file system workloads showcase the advantages of KVR, even with the current KVSSD read performance being relatively low compared to similar hardware block SSD. Despite the fact that KVR-PK-PF requires more than one I/O per get when keys need to be translated, its performance is barely affected under these workloads. To understand that, we analyze the workloads and found that most lstat operations are performed on hot file set whose keys do not need translation (application key equals physical key), thus KVR-PK-PF performs similarly to KVR-PF.
For simple parallel “ls” without “\(-\)l,” which is converted to a range query without value() operation, KVR-PF performs 21\(\times\) better compared to RocksKV. The cause of RocksKV’s poor performance is that the SSTable packs key and value together, thus the cost of range queries only calling key() is similar to range queries that calls both key() and value(). KVR-PF, KVR-PK-PF, and Wisckey have similar performance, since they separate keys and values.