Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views

blob

Uploaded by

samirsama794
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

blob

Uploaded by

samirsama794
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Why Files If You Have a DBMS?

Lam-Duy Nguyen Viktor Leis


Technische Universität München Technische Universität München
lamduy.nguyen@tum.de leis@in.tum.de

Abstract—Most Database Management Systems (DBMSs) sup- with an X-ray scan without a patient record, or a patient
port arbitrary-sized objects through the Binary Large OBjects record without its associated X-ray image.
(BLOBs) data type. Nevertheless, application developers usually • Transactions: File systems do not support transactions,
store large objects in file systems and only manage the metadata
and file paths through the DBMS. This combined approach has making it difficult to perform multi-file operations cor-
major downsides, including a lack of transactional and indexing rectly. Consider an administrator updating a web applica-
capabilities. Two factors contribute to the rare use of database tion, which modifies multiple configuration and resource
BLOBs: the inefficiency of DBMSs in such workloads and files. Without atomic multi-file operations (i.e., transac-
the interoperability difficulties when interacting with external tion), incomplete updates may occur, e.g. configurations
programs that expect files. To address the former, we present
a new BLOB allocation and logging design that exhibits lower in config files may reference deprecated resource files,
write amplification, reduces WAL checkpointing frequency, and leading to software inconsistencies and instability.
consumes less storage than the conventional strategies. Our • Indexing: File systems lack support for indexing file
approach flushes each BLOB only once and features only a single content or metadata, which is beneficial in many situ-
indirection layer. Moreover, using the Filesystem in Userspace ations. For instance, indexing facilitates tasks such as
framework, BLOBs can be exposed as read-only files, allowing
unmodified applications to directly access database BLOBs. The deduplicating files based on file content or organizing the
experimental results show that our design outperforms both file files in ascending order by their last modification date.
systems and DBMSs in handling large objects. • Performance: As we will show in Section V, accessing
Index Terms—Database management systems, Operating sys- files can be slow due to system call overheads [1, 2].
tems, File systems, User interfaces, Large objects, Strings
Downsides of BLOBs. Given these problems, one may won-
I. I NTRODUCTION der why storing BLOBs in database systems is uncommon. We
attribute this to two primary factors. First, database systems are
Data management systems are ubiquitous. DBMSs are com-
not optimized for handling BLOBs1 , with file systems often
monly used for addressing a wide and heterogeneous range
proving more efficient. Second, BLOBs are often accessed not
of real-world data management problems, offering valuable
just by the storage systems, but also by external programs
features such as ACID transactions and declarative queries.
that require the input data in files. Consider, for example, a
Their performance has been significantly optimized through
computer vision tool for classifying images, or a web server
decades of dedicated research and engineering, making them
serving image data for a content management system. In both
the default choice for diverse data management needs.
cases, the images will be expected as files. Having to copy
Large objects are stored as files. One major exception to
BLOBs to the file system for interoperability with external
the dominance of DBMSs is large binary objects. Although
programs would exacerbate many drawbacks of storing large
most DBMSs support the Binary Large Object (BLOB) and
objects within the DBMS.
arbitrary-length string (CLOB, VARCHAR, TEXT) data types,
Contributions. As pointed out in a CIDR keynote by Hannes
proprietary or specialized data such as audio, image, video,
Mühleisen [26], there is little research on managing large
and document objects are usually stored as files rather than
binary objects and strings in DBMSs. This work aims to
inside the database system. Imagine an application that man-
close this gap. First, we present techniques for efficiently
ages medical X-ray images: most developers would probably
managing BLOBs in database systems, showing that a DBMS
store all structured application data (including the patient data,
can outperform state-of-the-art file systems. To achieve this,
image metadata, and file paths) in a DBMS, but the image data
we propose a new BLOB physical storage format and logging
in a file system.
scheme for DBMSs. We write every BLOB only once to the
Downsides of files. Storing large objects in a file system
storage while ensuring its crash consistency, and use a single-
separately from the application data (which is maintained in
layer indirection called Blob State to obtain the on-storage
a DBMS) has several downsides:
location of every BLOB. This differs from existing approaches
• Durability: File systems and DBMSs have separate and
that write every object twice [11, 14, 10, 9, 24, 23, 15, 19] and
independent durability regimes (fsync vs. commit). Imag- use multi-level indirection layers to store BLOB [20, 11, 15, 6,
ine a situation where a crash occurs during the insertion of
a new X-ray image and its record: depending on whether 1 One notable exception is SQLite, which is specifically advertised for use
fsync or commit is executed first, one may end up either cases that would normally rely on file systems [2, 25].
TABLE I
L ARGE OBJECT IMPLEMENTATIONS IN THE E XT 4 FILE SYSTEM AND SEVERAL WIDELY- USED DBMS S

System Physical storage format Max size Read cost Indexing - Prefix limit Duplicated copies
Ext4 file system Multi-level extent tree [3] 16TB [4] High2 Not supported Journal [5]4
PostgreSQL TOAST relation [6] 4TB [7] Medium1 8191 bytes [8] WAL [9, 10]
SQLite Linked-list of pages [11] 2GB [12] High2 Arbitrary size [13] WAL [11, 14] & Index [13]
SQL Server Tree-like structure of pages [15] 2GB [16] High2 Not supported [17, 18] WAL [19]
MySQL/InnoDB Linked-list of pages [20] 4GB [21] High2 767 bytes [22] DWB3 & Redo [23, 24]
Our design Extent sequence 10PB5 Low Arbitrary size None6
1 Multiple lookup/scan to read a BLOB 2 Many indirection layers, I/O and computation interleave 3 Double-Write Buffer
4 Mount with data=journal 5 Theoretically 5.76 × 1017 YB with 127 extents and 4KB page, details in Section III 6 Except BLOB update

3, 5]. Second, we solve the interoperability issue with external Storage Technique (TOAST) implemented by PostgreSQL [6],
programs using Filesystem in Userspace (FUSE) interface. does not force I/O and computation to interleave. It organizes
We expose large objects as read-only files, allowing external the BLOB chunks (and metadata) in a separate “TOAST”
software to directly access DBMS-managed BLOBs without table. Consequently, every BLOB read involves two relation
code modifications. Overall, with our approach, applications lookups (the main relation and the TOAST table) in addition
managing BLOBs gain all functional benefits of DBMSs (e.g., to one scan to read all chunks. Because every TOAST page
transactions and indexing) without sacrificing performance or contains only four chunks by default [6], read operations must
complicating interoperability with external programs. scan through multiple database pages to retrieve the BLOB
content. In summary, these indirections significantly contribute
II. BACKGROUND
to the explanation of why accessing BLOBs is not always
A. Limitations of Existing Approaches efficient.
In the following, we describe how PostgreSQL, SQLite, Excessive BLOB writes. To ensure BLOB integrity, DBMSs
Microsoft SQL Server, and MySQL/InnoDB manage BLOBs, write every entry at least twice to the storage, both to the
and contrast them with Ext4, the default file system of Linux. database and log [11, 14, 10, 9, 24, 23, 15, 19]. This design
Ext4: Hierarchical extent tree. The standard file system in has two consequences. First, it increases the log size and thus
Linux, Ext4, maintains files in a multi-layer structure [3], as triggers WAL checkpointing more frequently, which slows
the following figure shows: down the database operations [28, 29]. Second, it increases
i_block ext4_extent_idx
write amplification excessively, reducing the longevity and
Inode

performance of the storage device if the DBMS runs on


extent extent node extent extent
header index index top of an NVMe SSD [30, 31, 32]. When mounted with
index index header
data=journal option, the Ext4 file system behaves sim-
ext4_extent_idx ext4_extent ilarly with the content of the new file also being written to the
node extent extent node
extent extent
journal [3, 5].
header index index header
Unnecessary BLOB copies. All systems maintain at least two
copies of every BLOB, one in the database and one in the log.
Data block Data block SQLite is the worst in terms of storage consumption because
it includes whole BLOBs in WITHOUT-ROWID index, and
Ext4 builds an extent tree structure for every file larger than it also logs the BLOB content from both the database and
512MB. The extent tree helps translate the file logical address index [14]. In total, SQLite creates at least four copies per
to the corresponding physical blocks [3, 27]. It is complex (for BLOB if both WAL and WITHOUT-ROWID index are enabled.
good reasons, which we will discuss in Section III). However, BLOB indexing limitations. Amongst the surveyed systems,
extent tree also has some limitations, one particular issue is the only SQLite supports full BLOB indexing. However, SQLite
tree traversal overhead. That is, accessing data in Ext4 requires doubles the content of those objects, storing them in both
navigating through several layers of the extent tree, mixing I/O the main relation and the BLOB index (WITHOUT-ROWID
and computation, which may reduce the performance. index [13]), and thus is not recommended if the object size
Inefficient storage format in DBMSs. The surveyed DBMSs is huge [13]. PostgreSQL and MySQL/InnoDB only index
utilize either an auxiliary structure [20, 11, 15] or a relation BLOB prefixes [22, 8], while SQL Server disallows indexing
to manage the BLOB chunks [6]. In the first approach used BLOB data altogether [17, 18].
in MySQL, SQLite, and SQL Server, the DBMSs store every Summary. Table I summarizes the existing approaches and
BLOB in multiple overflow pages, which are chained together contrasts them with our solution. We ensure BLOB durability
using a linked list or a tree. Consequently, queries will access without writing it more than once. Additionally, our single-
the overflow pages sequentially one after another [20, 11, 15], layer BLOB storage format is lightweight, simplifying BLOB
resulting in I/O interleaved with computation and thus higher operations. Finally, we support BLOB indexing like SQLite
query latency. The second method, The Oversized-Attribute but require no BLOB copy, saving storage consumption.
III. L ARGE O BJECT L IFE - CYCLE same number of tiers, e.g., if the system has 10 levels and
In this work, we assume that the DBMS runs on an NVMe each level comprises 10 tiers, then the number of tiers is 100.
SSD with a buffer cache that supports fixed-size pages. In most Any tier after this has the same size as the largest tier. For
DBMSs, the page size is usually 4-64 KB. The buffer manager any arbitrary tier, given its level and its position within that
heavily relies on page translation, which maps a page identifier level (both counters start at 0), the size of that tier is:
(PID) to an in-memory pointer that refers to the page content. (level + 1)no_tiers_per_level − position × (level + 2)position
We refer to those in-memory pointers as buffer frames.
With 10 tiers per level, the first two levels (20 tiers) are:
A. Extent Management
1 2 4 8 16
Existing approaches in DBMSs are ineffective. Current Level 0
32 64 128 256 512
DBMSs implement either an auxiliary index structure to 1k 1.5k 2.3k 3.5k 5.2k
Level 1
7.8k 11.7k 17.5k 26.2k 39.4k
manage overflow pages or use a system relation to manage
BLOB chunks. Despite being simple, these approaches have Assuming a 4KB page size, an extent sequence of 127
many limitations as described in Section II. This leads to an extents following this config can store a BLOB up to 10PB.
intriguing question: what if we implement the extent tree in Balancing storage utilization and max size. This formula
DBMSs specifically for BLOB management? improves the storage utilization compared to Power-of-Two
Why extent tree? There are several reasons behind the extent and Fibonacci. For instance, given a 4KB page size and five
tree. First, file systems should be efficient even in obscure tiers per level, the wasted space for a 20MB BLOB is 25%.
scenarios, including the hole-punching operation that deletes This number decreases as the BLOB size increases, dropping
middle extents and reclaim their space. Second, file systems to 7.3% when the BLOB is 51GB. However, an 127-extent
use a best-effort approach to allocate new extents by seeking sequence only supports a BLOB up to 246GB with this setting.
the largest free space available. Altogether, file systems store Increasing tier count per level allows larger BLOBs with a
a file as an arbitrary number of extents of arbitrary size, which trade-off of lower storage utilization. With 30 tiers per level,
requires the extent tree to manage them effectively. the first level already support a 4TB BLOB, and the storage
Are those requirements avoidable in DBMSs? We believe utilization of a BLOB fitting 120 extents is 20%, which is still
the answer is yes because typical applications either generate better than both Power-of-Two and Fibonacci.
static objects, store multiple versions of objects, or replace Tail extent: Arbitrarily-sized extent. For static BLOBs, the
object completely [19]. The operations in such scenarios – last extent may contain unused space, resulting in internal frag-
create/replace, read, and delete – do not interact with middle mentation. To prevent that, we allocate exactly one arbitrarily-
extents. Analogously, Amazon S3, a widely-used object stor- sized extent, termed tail extent, to replace the last extent. For
age system, also restricts user interactions to entire BLOBs, instance, in the example illustrated in Figure 1, the normal
disallowing partial updates and removals [33]. strategy (Figure 1(a)) allocates three extents, and the last one
Extent sequence. We suggest storing BLOBs as a flat list of has one empty page. With tail extent, as Figure 1(b) depicts,
extents (an extent is a contiguous range of physical pages), the DBMS allocates only two extents to store "Foo and Bar"
termed extent sequence. By enforcing exponential growth on and stores the rest in three consecutive pages.
this list – ensuring subsequent extents are always larger than
B. Blob State
previous ones – we can limit the size of this list while
supporting huge BLOBs. In other words, the list of extent Format. We bundle all BLOB metadata into a single structure
is small but still represent any arbitrarily sized BLOB. named Blob State. Every Blob State refers to only one BLOB.
Reducing BLOB metadata. The metadata necessary for Specifically, Blob State comprises the following properties.
BLOBs comprises the extent offset (the PID of the head page) • Size: Size of the referred BLOB.
and the extent size (number of pages). We propose replacing • SHA-256: The computed SHA-256 of the BLOB, is used
the extent size metadata with a table, which determines the for BLOB durability & indexing.
extent size using the static extent position, thus halving the • SHA-256 intermediate digest: The 32-byte intermediate
size of BLOB metadata. We call this table extent tier. SHA-256 hashed signature (i.e., before the last 512 bits
Extent tier: Constraint & goals. A good tier table is crucial of the BLOB and padding), used for BLOB growth
to how the system manages large objects. That is, it affects operations.
maximum BLOB size, simplicity and efficiency of BLOB
operations, amount of BLOB metadata, and storage utilization.
(a) Normal Blob State Sz: 6 pages Extent PID: P4 P10 P15
Existing formulas such as Power-of-Two and Fibonacci are not
suitable because of their high space consumption [34], i.e., (b) BS with Tail Extent Sz: 6 pages Extent PID: P4 P10 TE: P15, 3 Pages

50% wasted space for Power-of-two and 38.2% for Fibonacci,


Foo and Bar and extra Bar
hence a new formula is required.
P4 P10 P11 P15 P16 P17 P18
Extent tier: Proposed formula. Instead, we propose a new Extent 1 Extent 2 Extent 3 in case (a)

formula that utilizes storage more effectively. First, we logi-


cally split the tiers into multiple levels, and each level has the Fig. 1. A BLOB of 6 pages and two possible Blob States
(a) Overflow pages + BLOB physical logging (b) Extent sequence + Blob state + Async BLOB logging

Buffer Buffer
A very large object object A very large A very large object
manager manager
P3 P6 P7
Lazily evict
BlobTuple
Blob state
WAL buffer A very large object Ext: P3,P5
P6 WAL buffer
Flush upon commit P3 ->

DB WAL WAL DB

Fig. 2. Traditional design in popular DBMSs (a) vs. our proposed approach (b)

• Prefix: First 32 bytes of the BLOB. We will explain the Asynchronous BLOB logging. Our approach neither appends
usage and motivations behind Prefix and SHA-256 for BLOB content to the WAL nor writes the BLOB during
BLOB indexing in Section III-F. buffer eviction. Instead, we write all BLOB chunks during the
• Tail Extent: A pair of a Page ID and the number of pages. transaction commit. Figure 2(b) shows our design. First, the
Will only be populated if the BLOB has a tail extent. DBMS reserves the smallest extent sequence to store the new
• Number of Extents: The number of extents (excluding BLOB, i.e., two extents for a three-page BLOB in the example.
tail extent) used to store the content of the BLOB. After that, the system creates a Blob State, stores this Blob
• An Array of Head Page PID: A dynamic array of Page State in the corresponding relation, and then appends it to the
IDs, all of which refer to the head page (first page) of all WAL buffer. Upon transaction commit, the DBMS triggers
extents. By combining this array with the extent tier, the multiple asynchronous I/O requests to flush the WAL buffer
system can determine the physical address of all extents. (which contains the Blob State) and the extent sequence. Note
Physical BLOB size. As explained earlier, with extent se- that the DBMS only writes the dirty pages to storage, e.g.,
quence and extent tier, a small number of extents can represent only P[15..17] of the 3rd extent in Figure 1.
a huge BLOB. Therefore, the flexible array of the head page BLOB Recoverability. To ensure recoverability, the DBMS
PID is not necessarily long, allowing the Blob State to be small must guarantee that the Blob State is durable before writing
in size while still referencing an arbitrarily-sized BLOB. For the extents. This is because if the BLOB is flushed before
example, with the number of tiers per level is 8, a Blob State the Blob State is durable and then the DBMS crashes, the
of 801 bytes can refer to a BLOB of more than 16TB – the extents are lost and unusable, leaving unusable holes within
maximum file size that Ext4 supports [4]. the DBMS. Therefore, we write and call fsync() to persist
Example. Figure 1 illustrates a Blob State for a 6-page BLOB. the WAL buffer (which contains Blob State) before writing the
If the DBMS allocates the BLOB normally (Figure 1(a)), the extents. When a crash happens between the two events, during
Blob State will contain three extents: P4, P[10..11], P[15..18]. Analysis phase of the recovery process, we can use the SHA-
If the BLOB contains a tail extent (Figure 1(b)), the Blob 256 checksum to validate the BLOB content. If BLOB content
State will only have two extents: P4 and P[10..11], and the is faulty, the transaction committing that BLOB is considered
tail extent starts at P15, spanning three pages. failed and added to the UNDO transaction list.
Where to store Blob State. The DBMS should physically BLOB eviction. Conventional methods additionally write all
store the Blob State with the tuple for the BLOB column. BLOBs during eviction because all pages storing BLOB
Consider a sample relation of an Integer primary key and a content are marked dirty after the allocation. In contrast,
BLOB column. Every row of this relation should store Blob we flush all BLOBs at transaction commit, thus all BLOB
State for its associated BLOB column. All BLOB accesses extents2 are clean, eliminating the extra write. However, before
will first query this relation for the Blob State and then load completing the flush, concurrent transactions may evict one
all the extents using the retrieved Blob State. of the extents, causing the buffer pool to drop or replace
the corresponding frame(s) with other page(s). We prevent
C. Durability that using an atomic prevent_evict flag per extent, set
Redundant BLOB writes in conventional logging. Fig- to true post allocation and reset to false upon the extent
ure 2(a) depicts the BLOB allocation and logging approach flush is complete. Buffer manager does not evict extents with
deployed in major DBMSs. In this approach, the DBMSs break prevent_evict=true, avoiding undefined behaviors.
each BLOB into multiple chunks and store BLOB chunks on
D. Operations
random pages. After the allocation, all the BLOB parts are
copied to the WAL buffer, which is flushed to the non-volatile BLOB read. To load a BLOB, DBMS first looks up the BLOB
storage later. All these BLOB chunks will also be written out relation to obtain the Blob State. Using this Blob State, DBMS
to storage later during the buffer eviction process. That means, 2 Our solution evicts/synchronizes BLOB accesses on extent granularity. We
every BLOB is written to storage at least twice. will discuss more on this later in Section III-G.
Old BS 2 pages P4 P10 file APIs, which is the primary method for external programs
New data
to access BLOBs [35]. For example, computer vision libraries
Foo and Bar and extra Bar such as Tesseract OCR [36, 37] or OpenCV [38] work with
P4 P10 P11
image files instead of raw binary image data. One workaround
New BS 6 pages P4 P10 P15
is to copy the BLOBs into the file systems, which may be
expensive and potentially hide the efficiency of our design. An-
Foo and Bar and extra Bar other way is to rely on an IO wrapper over binary data similar
P4 P10 P11 P15 P16 P17 P18 to [39], yet it incurs extra complexity on external programs.
Such a wrapper is also not ubiquitous across programming
Fig. 3. Append new content to an existing BLOB
languages and thus can not be deployed everywhere.
Filesystem in Userspace. One solution is to integrate with
determines which extents are not in the buffer pool, assuming Filesystem in Userspace (FUSE) [40, 41] to provide the file
these extents consist of N pages. Then, the DBMS allocates N system interface with DBMSs. FUSE is the most popular
buffer frames for all those extents and reads the extents using framework in Unix OS that allows non-privileged users to
a single asynchronous IO system call. implement their file system in user space [41] without neces-
BLOB deletion and extent reusability. The extent tier design sitating kernel code modifications. By integrating with FUSE,
helps us reuse the deleted extents efficiently. Because tiers are we can facilitate seamless interoperability between the DBMS
static, it is sufficient to manage a list of free extents per tier. and file systems, and applications can access their BLOBs in
During BLOB removal, the start PID of all extents is added to DBMSs without modifying the source code.
a temporary list. At transaction commit, the DBMS moves the Relation as a directory. Consider a scenario where users
free extents from the temporary list to the free lists according want to store images within the DBMS, and now they want
to the extent tier. Subsequent transactions can either pick a to expose those images as read-only files. All the images can
free extent in these free lists or allocate a fresh one. be managed within the following relation:
Growing a BLOB. In the example shown in Figure 3, we
CREATE TABLE image (
append a four-page chunk (Bar and extra Bar) to a 2-page filename VARCHAR PRIMARY KEY, content BLOB)
BLOB. Because the last extent lacks space to store new
content, DBMS allocates more extents (one in the exam-
ple), and then memcpy() new data to the available space. With FUSE, assuming the mount point of the sample
Afterward, the DBMS adds the two dirty extents to the to- DBMS is /foo/bar, users can access all images in
flush list but only writes the dirty pages (P11 and P[15..17] /foo/bar/image directory. At the same time, users can
in the example). Then, DBMS re-calculates the SHA-256 also store BLOBs in other relations which appear in dif-
signature by resuming previous SHA calculation (based on ferent directories. For example, documents can be stored in
the stored intermediate SHA digest) with new appended data, document relation, and users can interact with those BLOBs
i.e., preceding BLOB data is not loaded into the buffer pool. as files in /foo/bar/document directory.
Finally, the DBMS updates the Blob State to reflect the latest
details of the BLOB. For a BLOB with a tail extent, the DBMS
1 int FUSE_open(char *path) { // open() system call
can grow that object by cloning the tail into a new normal 2 db->StartTransaction();
extent and following a similar procedure. 3 return 0;
Updating a BLOB. To update a BLOB, the DBMS determines 4 }
5 int FUSE_flush(char *path) { // close() system call
the extent(s) that should be modified. After that, for each 6 db->CommitTransaction();
extent, DBMS either (1) creates a delta log which contains the 7 return 0;
difference between the old and the new data, appends the log 8 }
9 // pread() system call
record to the WAL buffer, and then in-place updates the extent, 10 int FUSE_read(char *path, u8 *buf, u64 size, i64
or (2) allocates a clone extent of the same tier, then updates offset) {
this clone and the corresponding metadata in the Blob State. 11 // 1. Check whether file exists or not
12 auto &[relation, filename] =
These two schemes are better in different situations, i.e., in the ExtractRelationAndFileName(path);
first scheme, new data is written twice, while the second writes 13 BlobState state = db->LookUp(relation, filename);
old data one more time. Evaluating the cost of both schemes 14 if (state == nullptr) { return -ENOENT; }
15 // 2. Path exists, read the BLOB
and selecting the better approach at runtime is straightforward. 16 assert(state->size > offset);
Nevertheless, because most applications primarily interact with 17 size = std::min(size, state->size - offset);
entire BLOBs, we argue that writing data twice (either old or 18 db->ReadBlob(state, [&](std::span<u8> blob) {
19 std::memcpy(buf, blob.data() + offset, size);
new data) in this scenario is acceptable. 20 });
21 return size;
E. Interoperability With File Systems 22 }
External apps mainly use files. One limitation of storing
large objects in DBMSs is that such systems do not provide Listing 1: FUSE integration: Pseudo code for read operation
Expose BLOBs as read-only files. Listing 1 shows how to identical, then one of the two BLOBs is the prefix of the other,
implement read operation in FUSE integration. To ensure so we compare the size of the two objects and return the result.
subsequent reads on the same BLOB are consistent, we wrap Semantic index for BLOB. There are situations where in-
all read inside a transaction. This is achieved by implement- dexing the semantic meaning of BLOBs is more appropriate
ing open and flush FUSE operations (which are triggered than the raw binary data. One way to implement that is
by open() and close() system calls, respectively) to start to support index based on a function or scalar expression
and commit a transaction (lines 1 to 8). For read operation, computed from the BLOB attribute of the relation, similar
the DBMS first looks up the relation (e.g., table image) to to Expression Index in PostgreSQL [45]. With Blob Tuple,
check if the file exists (lines 12 to 14). If the file exists, we use the DBMS can dynamically compute the derived data for the
the Blob State to load the BLOB content and copy that to the Blob Tuple comparator during query execution. Below is one
user buffer (lines 17 to 20). Other read-only operations such example regarding how users interact with the semantic index:
as getattr are implemented similarly to read, i.e., a point
query to obtain the Blob State to satisfy those operations. CREATE UDF classify(blob) -> TEXT;
CREATE INDEX foo image(classify(content));
SELECT * FROM image WHERE classify(content)=’cat’;
F. Indexing
Problems of current approaches. Popular DBMSs either In this example, the Blob Tuples are sorted according to
index only BLOB prefixes that misses many records, or store the classify() UDF. During SELECT, the DBMS scans
full BLOBs in all tree nodes of secondary indexes including through all Blob Tuples classified as cat and returns those data
inner nodes, increasing tree height significantly and reducing records to the user.
performance. One solution is to use TOAST storage [6] and
index the BLOB IDs based on their content. However, this G. Extent Synchronization And Eviction
requires tight coordination between relation APIs (e.g., tree Synchronization: Coarse-grained vs. fine-grained. To ac-
scan) and indexing, which is hard to implement correctly. cess/evict an extent from the buffer pool, we can either
Blob State index. Instead, by implementing a comparator for use fine-grained synchronization (one latch per page) or use
Blob State, index structures can store the Blob States in sorted coarse-grained latching (synchronize on the first page of the
order according to their BLOB content. This mirrors the above extent). The former design is more complex and may introduce
approach proposed for TOAST, which uses BLOB IDs as the overheads. For example, when N threads attempt to read the
indexed key. The difference is that the Blob State index avoids same extent of N pages from storage, all workers contend for
working with other relations, and it also accesses BLOB data N latches, each wins one and then calls pread() to fetch
directly, thus being cheaper and less complicated. Note that one page. Contrarily, with coarse-grained latching, only one
the indexing structure is untouched, and DBMSs can use any worker will call pread(), allowing the remaining workers to
data structure like B-Tree or ART [42]. work on other tasks. Therefore, we opted for coarse-grained
SHA-256 and BLOB prefix. For point queries, comparing latching for extent synchronization/eviction.
entire BLOBs per every comparison is inherently expensive. Fair extent eviction. In the coarse-grained extent synchroniza-
Instead, we suggest using SHA-256 for more efficient BLOB tion design, the eviction probability of an extent may be similar
equality checks3 . Analogously, for range queries, a complete to that of a normal page. However, we argue that an N-page
BLOB comparison may be unnecessary. A cheaper option is extent should have an eviction probability N times higher than
to store the BLOB prefix inside the Blob State, allowing the a single page. To do that, we adjust the eviction probability
comparator to skip BLOB dereferencing in some situations. of every page and extent according to its size:
Incremental comparator for Blob State. Because the com- if (rand(MAX_EXT_SIZE) ≤ extent_size[pid]) Evict();
parator will be extensively used during the index operations,
comparing the full BLOB content of two Blob States in a H. Discussion
comparator is costly and possibly unnecessary. Instead, we Tail extent vs. Extent tier formula. Tail extent completely
propose to compare Blob States incrementally. Assuming resolves the storage utilization issue compared to the tier
that the two Blob States do not contain a tail extent. For formula, but it slows down BLOB growth operations. That
point queries, the comparator evaluates the SHA-256 values is, appending new data to a BLOB with a tail extent is
embedded in the two Blob States and returns the result. Range more expensive than that for a normal BLOB because of the
queries involve additional steps: after the equality check, we extent clone operation, which includes one extent allocation
use the embedded prefix for a cheap range check. If the two and memcpy() data from the tail extent to the new extent.
BLOBs have the same prefix, then we compare all the extents Generally, the tail extent should be used if the workloads do
of the two BLOBs incrementally. Finally, if those extents are not involve growth operations. We summarize the differences
in the table below:
3 We acknowledge that SHA-256 may not be theoretically foolproof for
equality checks. However, the fundamental reliance of Bitcoin on SHA-256 internal frag. growth op.
to resist collision attacks [43, 44] implies its practical suitability for critical tail extent minimal slow
applications, including DBMSs, in ensuring reliable uniqueness checks. extent tier formula low fast
Concurrency control for BLOBs. The primary focus of this 2. Reserve a
Blob State Aliasing area
free VM block
work is orthogonal to BLOB concurrency control. However, 1. Read
4. Clones
let us mention one possible design: to use a Single-Version Foo and Bar phys addr
of extents After 4. 3. Call virt
Concurrency Control protocol such as 2PL [46], OCC [47],
mem aliasing
or Silo [48] on the Blob State relation. For example, with User 5. Read the aliasing area Foo and Bar

2PL, when transaction A wants to update a BLOB, it acquires


Page table manages Exmap
an exclusive lock on the record that contains the required Blob
Physical memory
State and then modifies the BLOB content. Now, transaction Bar and Foo
B concurrently accesses the same Blob State and then finds
out that the required tuple is locked by transaction A. Conse- Fig. 4. Virtual memory aliasing operation. An aliasing area is a contiguous
quently, transaction B aborts or waits according to any conflict range of virtual memory addresses
resolution scheme [49, 50, 51, 52, 53, 54].
IV. V IRTUAL -M EMORY A SSISTED O PERATIONS a technique that copies the physical addresses of an extent
As Section II discussed, popular DBMSs implement com- sequence and maps that to a free virtual memory space,
plex and inefficient mechanisms to store large objects, primar- presenting disjointed extents as contiguous memory.
ily due to their reliance on buffer management designs that B. Virtual Memory Aliasing
only support fixed-size pages. Recent work on buffer manage-
Operations. We depict virtual memory aliasing in Figure 4.
ment relies on virtual memory to implement a buffer man-
First, when a transaction reads a BLOB of multiple disjointed
ager that allows variable-sized pages. The proposed methods,
extents, it retrieves the Blob State and loads all the extents into
vmcache and exmap [55], simplify the implementation and
the buffer manager. After that, the transaction requests a free
enhance the performance of BLOB operations compared to
contiguous range of virtual addresses (termed aliasing area),
the previous buffer pool designs. This section details how our
and then calls memory aliasing operation on this aliasing
design benefits from these new buffer management techniques.
area. Consequently, exmap updates the page table to map the
A. Virtual-Memory Assisted Buffer Manager physical addresses of the aliasing area to that of all the extents.
Problems of fixed-size pages. With fixed-size pages, DBMSs Finally, users access the aliasing area which depicts the BLOB
arrange extents and pages as arbitrary disjointed buffer frames, content as a single contiguous memory block.
i.e., most BLOBs are not represented as contiguous memory. Aliasing area: Constraints. It is reasonable to bound the
As a result, either external libraries must work explicitly with required number of virtual addresses. Because the largest
BLOB chunks, e.g., for regex search, external libraries apply object is limited by the maximum buffer pool size, the aliasing
regex matching on every BLOB chunk, or the DBMS must area is unnecessary to be bigger than that. One may wonder
allocate a big memory chunk and then memcpy() BLOB whether to use N separated aliasing areas for N workers so
content to this memory block before processing, consuming concurrent workers do not need to synchronize. However, this
memory bandwidth extensively. approach consumes an excessive number of virtual addresses,
vmcache. A recent work, vmcache [55], exploits the virtual i.e., ten workers with a database size of 160GB requires
memory to implement a simple yet effective buffer manager. approximately 420M virtual addresses.
This technique helps manage BLOBs for two reasons. First, Aliasing area: Proposed design. The following figure illus-
vmcache presents an extent as contiguous memory and needs trates the design of the aliasing area:
only one page translation per extent to retrieve the buffer Same size with buffer pool

frame(s). Contrarily, previous buffer pool designs (e.g., hash Shared aliasing area
2. BLOB worker_local_size
table or pointer swizzling [56, 57, 58]) trigger exactly N 1st 2nd 3rd .... N-th
page translations for the same task. Second, assuming we 1. Small BLOB
Worker-local N logical blocks
aliasing area
can adjust the mapping of virtual to physical memory in user Workers
space during runtime, vmcache can present a list of disjointed
memory blocks (i.e., extent sequences) as contiguous memory. Every worker has one exclusive worker-local aliasing area.
Virtual memory remapping. One potential method that In the first case, when BLOBs are smaller than size of
allows virtual memory remapping is Rewired User-space the worker-local area (worker_local_size), the worker
Memory Access (RUMA) [59]. However, RUMA slows vm- uses its local area without contention with other workers.
cache down significantly in out-of-memory workloads due to Otherwise (case 2), the worker requests free contiguous virtual
its memory management method4 . Instead, based on exmap addresses from a shared pool (shared aliasing area). The
which provides performant and scalable page table manipu- shared pool is split into N logical blocks, with each block is
lation primitives [55], we propose virtual memory aliasing, similar in size to the worker-local area. During reservation, the
4 RUMA uses an in-memory file and SHARED mmap() to manage OS page
worker exclusively uses a range of contiguous logical blocks
table. In this design, page eviction requires fallocate(PUNCH_HOLE) to sufficiently large to alias the BLOB. The DBMS uses a range
free the physical memory of the memory file, which is very slow. lock to synchronize concurrent accesses to the shared area.
Lightweight synchronization on shared area. An intriguing DBMS config. For PostgreSQL and MySQL, we configure to
finding is that, with an appropriate worker-local size, we can connect to the server using a Unix socket. We use a 32GB
limit the number of logical blocks to a small amount while also buffer pool for MySQL and SQLite, and a 16GB shared
capping the number of virtual memory addresses. Assuming buffer for PostgreSQL as recommended [71]. Since this work
the buffer pool is 160GB, i.e., shared area is also 160GB. If focuses on BLOB buffer and storage management which is
the worker count is 10 and the size of a worker-local area is orthogonal from transactional aspects, we run all DBMSs
1GB, then the number of logical blocks is 160 and the total in the lowest transactional isolation level offered. To ensure
size of the aliasing areas is 170GB, which is only 6.25% bigger fair comparisons with file systems, we disable both BLOB
than the buffer manager. And because the number of blocks compression and fsync() for all competitor DBMSs.
is small, we can use a simple range lock using a bitmap and Competitors: File systems. We also evaluate our design
compare-and-swap, i.e., in the above example, the bitmap only with four file systems: Ext4 [3], XFS [72], BtrFS [73], and
has 160 bits which corresponds to 3 uint64_t. F2FS [74]. For the Ext4 file system, we mount it with
Overhead: TLB shootdown. One issue is that the worker two options: data=journal and data=ordered, and
must invalidate the mapping between the virtual addresses of we refer to them as Ext4.journal and Ext4.ordered,
the aliasing area and the physical memory pages. This involves respectively. With Ext4.journal, data is also written to
clearing the corresponding page table entries and invalidating the journal. On the other hand, Ext4.ordered only writes
the TLB cache (i.e., TLB shootdown), which interrupts all file metadata to the journal and only does so after the data is
CPU cores and clears the TLB of all CPUs. Although the flushed to the secondary storage.
overheads can be nonnegligible [60, 55], we argue that mem- File system config. We disable readahead because it is or-
ory aliasing substantially simplifies BLOB operations and is thogonal to the core operation. As stated earlier, we do not
cheaper than the malloc() and memcpy() combination. use fsync() for all file system benchmarks because it would
We will explain this later in Section V-E. become the dominant overhead in every file system benchmark
Size of worker-local area. The worker-local area needs not if it was enabled. Moreover, our implementation uses group
to be big because the BLOB size constraint is small in commit so the critical path usually does not involve I/O.
practice [61, 62, 63]. With a proper configuration like 1 GB,
B. Evaluation of BLOB Logging
the shared area will be rarely used. Even if the worker uses
the shared aliasing area, i.e., BLOB is bigger than the local Experiment information. We evaluate our logging scheme
area, the contention on the shared area is insignificant to other using synthetic YCSB workloads with different payload sizes.
operations. Further elaboration on this will be provided in Specifically, we use five configurations: 120 bytes, 100KB,
Section V-F. 10MB, one workload with a random size between 4KB and
10MB, and 1GB. The working dataset of all experiments fits in
V. E VALUATION memory. We run the following experiments in single-threaded
In this section, we empirically show that our approach mode with a read ratio of 50%. We use a simple memcpy()
depicts superior performance to file systems in BLOB man- as the BLOB read operator. BtrFS is not shown because its
agement – although we disable fsync() for all competitor performance is almost identical to Ext4 ordered.
DBMSs and file systems – while still offering qualitative Baselines. We implemented two baselines: Our.ht and
benefits like transactional semantics and durability. Our.physlog. Our.ht uses a traditional hash table buffer
pool instead of vmcache+exmap, thus not benefit from vir-
A. Experiment Setup tual memory aliasing. Our.physlog employs all techniques
Implementation. We integrate our proposed techniques, de- except async BLOB logging. Instead, it appends every large
noted as Our, into LeanStore [57], an open source storage object to the write-ahead log. To accommodate BLOBs larger
engine. In this version of LeanStore, we implement vmcache than the WAL buffer, we split every BLOB into small segments
and exmap [55] as the buffer manager. The default size of and append these segments to the WAL buffer.
the buffer pool is 32GB. Our implementation uses distributed 120B payload. First, we evaluate all systems with normal
per-thread write-ahead logging with page-level dependency YCSB and show the result in Figure 5. All file systems
tracking [28, 64], combined with group commit [65, 66]. and SQLite provide higher throughput than PostgreSQL and
Hardware & OS. We ran all experiments on a single-socket
machine with an Intel Core i7-13700K (16 cores, 32 hardware 1.2m
Tput (txn/s)

threads), 64GB DRAM, and a Samsung SSD 980 Pro M.2 as


the storage. For OS, we use Linux 6.2 with exmap installed. 800k
Competitors: DBMSs. We compare our implementation 400k
against three popular DBMSs: PostgreSQL [67], MySQL/Inn-
0
oDB [68], and SQLite [69]. We do not evaluate DuckDB [70] MySPostg SQ Ext4. Ext4 F O Ou
QL reSQ Lite journXFS .order 2FS ur.ht r.physlo
Our
because there exists a comparison between SQLite and L al ed g
DuckDB in BLOB workloads [2], and also because DuckDB
was not designed for managing large objects. Fig. 5. YCSB benchmark with normal payload size (120B)
Tput (txn/s)

Tput (txn/s)
30k 300
20k 200
10k 100
0 0
MySPostg Ext4. SQ Our.p Ext4.o F2 O MySPostg SQ Ext4. Our.p Ext4.o XF F2 Ou
QL reSQ journ Lite hyslo rder FS XFS ur.ht Our QL reSQ Lite journ hyslo rder S FS r.ht Our
L al g ed L al g ed

(a) 100KB payload (b) 10MB payload

4
Tput (txn/s)

Tput (txn/s)
600
3
400 2
200 1
0 0
X X
Pos E E O Pos O E E
tgre MySQ SQLit xt4.jouxt4.ord XFS F2FS ur.phy Our.h Our tgre SQLit MySQ ur.phy F2FS XFS xt4.jouxt4.ord Our.h Our
SQL L e rnal ered slog t SQL e L slog rnal ered t

(c) Mixed (4KB - 10MB) payload (d) 1GB payload. The PostgreSQL and SQLite benchmark scripts fail

Fig. 6. YCSB benchmark with BLOB payload. fsync() is turned off for all systems except Our, Our.ht, and Our.physlog

MySQL because these systems only operate in main-memory. to finish. By increasing the size of the WAL buffer (e.g., from
In contrast, PostgreSQL and MySQL incur additional commu- 10 MB to 50 MB), this overhead becomes smaller, but the
nication and (de)serialization overheads. Our DBMS provides overall throughput is still lower than that of Our.
at least 3.5× higher throughput compared to other systems. Mixed 4KB-10MB payload. Popular DBMSs exhibit poor
100KB payload. As Figure 6(a) shows, MySQL and Post- performance in this experiment, as depicted in Figure 6(c).
greSQL provide poor throughput, also because of the net- Ext4.journal is the worst amongst all file systems, trail-
work and serialization overheads of these DBMSs. All file ing Ext4.ordered by 45%. Surprisingly, the performance
systems have comparable throughput (including BtrFS), ex- differences between Our and file systems are larger than in
cept Ext4.journal. Ext4 journal exhibits bad performance previous experiments because of OS file size modification
because includes I/O in the execution time while other file overhead, which includes ftruncate() to resize files and
systems do not, and it also triggers journaling operations new buffer allocation in the page cache. On the other hand,
more excessively. One notable observation is that SQLite is Our and Our.ht handle this workload without imposing
faster Ext4.journal, i.e., it does not trigger I/O during extra overhead. This also explains why Our.physlog is
transaction execution. All file systems are slower than Our and faster than file systems in this experiment.
Our.ht because of system call overheads. Our.physlog 1GB payload. As illustrated in Figure 6(d), all enterprise
is 11% slower than the Our because of the WAL operations. DBMSs perform poorly. Specifically, the PostgreSQL client li-
10MB payload: All systems vs. Our. Figure 6(a) shows brary returns Statement parameter length overflow, and SQLite
that PostgreSQL and MySQL still depict bad performance. gives BLOB too big error, leading to benchmark failure. Two
For file systems, Ext4.journal remains the slowest due baselines and file systems perform similarly, and they show at
to the journaling. SQLite is slower than Ext4.journal least 70% less throughput than Our. This experiment exhibits
in this experiment because it triggers WAL checkpointing the benefits of our proposal compared to existing techniques.
aggressively (2.5 checkpoints per BLOB write [2]). Other Hash table buffer pool vs. vmcache+exmap. In the ex-
file systems provide comparable throughput, all are at least periment shown in Figure 6, Our.ht surpasses all other
13% slower than Our because of one extra memory copy systems, showing the advantages of the proposed BLOB
call. That is, file systems cause two memory copy, i.e., one designs. Still, Our.ht is not as performant as Our because
from pread() system call and another from the BLOB read (1) vmcache+exmap is lightweight and more performant
operator in the application. Contrarily, only one memory copy than hash table [55] and (2) Our.ht does not benefit from
is required in Our because it replaces pread() with the virtual memory aliasing. We will demonstrate the differences
lightweight virtual memory aliasing. further in Section V-E.
10MB payload: Our.physlog vs. Our. Our.physlog is
slower than Our, provides 30% less throughput, mainly C. Evaluation of Metadata Operations
because the hot path includes time waiting for the group Description. In the experiment illustrated in Figure 7, we
committer to flush the BLOBs. That is, because the BLOB compare the efficiency of metadata operations between our
size is as big as the configured WAL buffer, transactions approach and file systems. To do that, we either retrieve the
must spend considerable time waiting for the group commit Blob State of 10 consecutive BLOBs or call fstat() on ten
Throughput (txn/s)
Tput (txn/s) 1m Systems
6k
Ext4
750k
15.6x 4k 3.9x BtrFS
500k
250k 2.9x XFS
2k
0 F2FS
BtrFS XFS Ext4 F2FS Our 0 Our
1 4 8 12 16 20
Fig. 7. Metadata operations in our approach vs. file systems Time (s)

Fig. 9. BLOB Read evaluation (cold cache)


consecutive files in all file systems. We do not evaluate the
competitor DBMSs because they are not performant enough
user space, while we use virtual memory aliasing which avoids
as shown in previous experiments. The BLOB payload size in
one memory copy operation.
this experiment is 100KB.
Cold cache experiment. Figure 9 depicts the performance of
Result. As being shown, all file systems have similar perfor- all systems when the page cache (or buffer pool) is empty. All
mance. Our DBMS provides 15.6× more throughput than all file systems perform similarly, and Ext4 shows the highest per-
file systems. This is because our approach maintains all BLOB formance among them. Our DBMS consistently outperforms
metadata in a B-Tree index that provides efficient lookup/scan all file systems, at least 2.9× at the start of the benchmark.
queries, while metadata operations of file systems are very This is because our proposed storage format is more simple
slow [1, 2], resulting in significant differences. than that of file systems, as described earlier in Section III.
Therefore, our DBMS is better at utilizing the NVMe SSD
D. Evaluation of Proposed Extent Data Structure compared to file systems, i.e., the upper bound read I/O of
Description. This experiment evaluates the proposed phys- Ext4 is 59MB/s, while that of our DBMS is 174MB/s. And,
ical storage format using real read-only Wikipedia analytic when the buffer cache gets full quicker, our DBMS can serve
datasets. First, we collect English Wikipedia analytic data, more only-in-memory transactions, thus resulting in a 3.9×
specifically the article size and their corresponding views, difference in throughput at the end of this experiment.
and build a database based on this distribution. The total size
of all articles in this experiment is 23GB. During the initial E. Benefits of vmcache & exmap
phase, we insert random data according to the article sizes. Description. Our constantly outperforms Our.ht through-
In the benchmark phase, we pick a random article according out the experiments in Figure 6, proving the effectiveness
to the article views and execute a memcpy() to simulate the of vmcache+exmap for BLOB operations. This is mainly
article read. Similar to the previous experiment, we do not because vmcache+exmap is better at read operations than
evaluate the widely-used DBMSs (i.e., PostgreSQL, MySQL, the traditional hash table buffer cache. In this experiment,
and SQLite). We run all benchmarks in two modes: when we further analyze the benefits of vmcache+exmap using a
all data resides in memory (hot cache) and when all data is read-only in-memory YCSB workload. Similar to the logging
evicted. Because this is a read-only workload, the file system experiment in Section V-B, we use a simple memcpy() as
journal is unlikely to affect system performance, and thus, we the BLOB read operator.
do not evaluate Ext4 journal mode. Result. As Figure 10 illustrates, both Our and Our.ht
Hot cache experiment. In the experiment illustrated in Fig- perform similarly for small BLOBs (100KB), and actually, the
ure 8, we keep the page cache (or buffer manager) untouched hash table approach is slightly faster than vmcache+exmap.
after loading initial data. This figure shows that our DBMS This is because TLB flush is more expensive than malloc()
outperforms all file systems by at least 40%. There are two & memcpy() when BLOBs are small. However, with bigger
reasons. First, the overheads of fstat, open, and close BLOBs, 1MB and 10MB, Our surpasses Our.ht signifi-
in file systems are significant, while our approach does not
suffer from that as Section V-B shows. Second, pread() in
Systems Our Our.ht
file systems causes extra memory copy from kernel space to
100KB 1MB 10MB
Throughput (txn/s)

60k
600k 3k
200k
Tput (txn/s)

400k 40k htable satur. mem


2k
150k L3 cache
200k 20k content. 1k
100k
50k 0 0 0
0 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16
Ext4 XFS BtrFS F2FS Our Number of threads

Fig. 8. BLOB Read evaluation (hot cache) Fig. 10. vmcache+exmap vs. hash table-based buffer pool
cantly, up to 2.1× when the worker count is 16 and 10MB objects. Specifically, we perform two operators: (1) allocate
BLOB. There are two reasons; the first is that the cost of a BLOB of random size between 1MB and 10MB, and (2)
the memcpy() becomes considerable. Second, malloc() delete a random BLOB. The allocation ratio is 80%, and the
creates an anonymous memory block to be filled later by the deletion ratio is 20%. Because allocation is 4× more frequent
memcpy(), causing page faults and allocation. than deletion, the storage capacity will increase with time
Key: memcpy() saturate memory hierarchy. Another obser- until the database/file system is full. We fix the database size
vation is that the Our.ht can not scale to 16 workers when (partition size in the case of file systems) to 32GB. For Ext4,
the BLOB size is either 1MB or 10MB. For the first case, i.e., the journal mode will reduce both the storage utilization and
when the BLOB size is 1MB and 16 workers, the combined the throughput, hence we do not evaluate Ext4.journal,
size of the client-side buffer and the internal DBMS memory i.e., the Ext4 in this experiment is Ext4.ordered.
block for the BLOB exceeds L3 cache capacity (30MB in our Result. As Figure 11 shows, almost all file systems except
machine), leading to contention at L3 cache. In the latter case, F2FS drop in throughput when the storage nearly reaches
the 16-workers variant not only contends for the L3 cache, but its limit. This is because those file systems use complicated
it also saturates the available memory bandwidth because of mechanisms to prevent fragmentation, which will not work
two memcpy() calls. well when the storage is almost full. Our extent recycling de-
sign, on the other hand, is lightweight and effective at reusing
F. Shared-Area Synchronization Overhead
deallocated extents, thus can maintain system performance in
Description. To evaluate the synchronization overhead, we run different storage utilization states. This also proves that our
a YCSB read-only workload with 10MB BLOBs. We run the design works reasonably well with mixed payload size, both
benchmark with 16 workers, and the maximum buffer size in terms of performance and storage utilization.
is 128GB. We use two worker-local sizes: 4MB and 16MB.
With the 4MB setting, the local area is smaller than a BLOB, H. BLOB Indexing
so transactions ask for free virtual addresses from the shared Description. To evaluate the Blob State index, we compare
aliasing area, which incurs contention overhead. For the 16MB it with the 1K prefix index which presents the approach used
setting, because the worker-local area is bigger than the BLOB, in MySQL and PostgreSQL. The indexed data is the English
no synchronization overhead occurs. Wikipedia [75], which contains many big articles. In this
dataset, 43 percentile of the article is larger than 767 bytes,
TABLE II which is the indexing limit of MySQL [22]. For PostgreSQL
OVERHEAD OF SHARED - AREA SYNCHRONIZATION
limit, i.e., 8191 bytes [8], it is 95 percentile.
Use shared area kernel cache
(wrk-local size) txn/s instruct. cycles cycles misses
Yes (4MB) 3,453 1,311k 14M 714k 14k TABLE III
No (16MB) 3,477 1,321k 14M 703k 14k S TATISTICS OF TWO INDEXING VARIANTS

miss build time size throughput


Result. Table II depicts that the two variants perform similarly. Variant (%) (ms) (MB) # leaf (lookup/s)
All statistics such as the number of cycles, cache misses, Blob State 0% 350 88 22k 443k
1K Prefix 17% 1,323 737 187k 438k
instructions, and kernel cycles are all almost analogous. As
a result, the throughput of both variants is similar. Therefore,
Result. As Table III shows, the Blob State index can store
the extra overhead caused by the synchronization on the shared
all articles in the index, i.e., miss(%)=0, while the prefix
area is trivial, while providing all necessary functionalities and
index can not serve 17% of all queries. This is because many
also limiting the number of virtual addresses.
documents have the same prefix, and the prefix index can
G. Extent Reusability only store one of them. Contrarily, the Blob State index can
Description. In this experiment, we evaluate the free extent differentiate the articles using their full content and hence can
management design by constantly allocating and deleting index all articles. Furthermore, Blob State index creates signif-
icantly fewer leaf nodes (22k compared to 187k), resulting in
faster construction time and lower storage consumption (3.8×
Systems Ext4 BtrFS XFS F2FS Our
and 8.4×, respectively). Besides, because we implement prefix
400 compression which is preferable to prefix index [76], both
Tput (txn/s)

300 indexes have the same tree height and thus provide similar
200 lookup performance.
100 Ext4 I/O burst
0 I. Real Write-Intensive Workload: Git Clone
80 90 100 Description. We contrast our approach with file systems using
Capacity Utilization (%)
a simulated Git Clone benchmark. To do that, we collect the
Fig. 11. Performance at different storage utilization. All systems eventually filesystem-level traces of the following git command:
stop at full storage capacity. System performance is stable before the storage
utilization reaches 80%. git clone --depth 1 git@github.com:torvalds/linux.git
We implement the simulated workload according to the traces why SQLite significantly outperforms the two client-server
and run the workload in single-threaded mode. The size of the DBMSs. Existing works on improving the DBMS network
experimental dataset is 1.28GB. stack fall into two categories: (1) avoid unnecessary compu-
tation in the network stack and (2) utilize modern hardware.
TABLE IV Some notable techniques for the first category include a new
G IT- CLONE BENCHMARK
data serialization method for large result sets [80], pushing
System time (ms) instructions kernel cycles DBMS logic to the kernel space to mitigate the networking
Our 906 65k 9k overhead of DBMS proxy [81]. For the second category,
Ext4.ordered 1,834 256k 81k
Ext4.journal 2,330 311k 108k
some studies focused on utilizing RDMA for remote data
BtrFS 1,688 194k 66k accesses [82, 83, 84, 85] or leveraging NVMe over Fab-
F2FS 2,112 236k 97k rics [86, 87, 88]. One particular work that can placed in both
XFS 1,464 188k 56k
categories: Fent et al. [89] proposes to replace conventional
Result. As illustrated in Table IV, file systems fail behind network protocols (e.g., TCP over Ethernet) with a novel
our DBMS significantly, largely because of the overheads of communication library that provides unified APIs for adaptive
metadata operations: fstat, close, and especially open. selection of RDMA and Shared Memory. These techniques
Specifically, Ext4 ordered spends 36% of the execution time can enhance BLOB access over the network, an area we will
on open for file creation. This number for fstat and explore in upcoming research.
close are 4.8% and 1.6%, respectively. XFS performs the DBMS-backed file systems. There is limited work on file
best because it only spends 36.6% of the execution time systems backed by DBMSs, with the Oracle Database Filesys-
on system calls, the least compared to other file systems. tem [35] as a notable exception. Aligning with our approach,
Our approach, however, mitigates this overhead, i.e., replacing Oracle DBFS utilizes the FUSE library to provide POSIX-
all three system calls with efficient B-Tree operations as standard file system interfaces to connect to the DBMSs. It
demonstrated in Section V-C. differs from our approach in that Oracle DBFS essentially acts
as a translation layer from file APIs to DBMS interface (i.e.,
VI. R ELATED W ORK PL/SQL procedure calls), while our solution provides direct
Ubiquitous BLOB storage: File systems. File systems have data accesses identical to file systems.
always been one of the core research areas of computer Aging and fragmentation. All systems supporting variable-
science, and they have adapted to manage objects of various sized objects suffer from the aging problem, i.e. performance
sizes, including large objects. One benefit regarding large can decline with time in some workloads because of the
objects that file systems have over DBMSs is access simplicity increasing fragmentation. For example, after the application
and efficiency. That is, file systems support direct access to allocates lots of small BLOBs and deletes most of them, the
BLOB data, while DBMSs introduce additional overheads storage system may struggle to locate a suitable extent for a
during BLOB operations such as transactional processing, huge BLOB allocation. We note that system aging is an active
logging, networking overhead, and many more. research topic in file system community [90, 91, 92, 93, 94].
Large object management in DBMSs. There is little work Despite that, file system aging is still not a solved prob-
on BLOB management in DBMSs, which partly explains the lem [92, 93], and some recent works only tried to mitigate
prevalence of file systems. To our knowledge, there are only it [92, 94]. We believe that, in principle, out-of-place write
two academic works on this topic, both conducted before policy can solve the aging problem. The core idea is to
2010, and these works primarily focused on the performance decouple logical PID from the on-storage physical address.
characteristics of DBMSs. In 2006, Sears et al. [19] argued Consequently, the DBMS can allocate every extent as new
that file systems are better than DBMSs for large objects, and map those PIDs with the available physical addresses in
and provided some comparative BLOB benchmarks between secondary storage. Because it is a significant topic, we plan
SQL Server and NTFS file system. Subsequently, in 2008, to investigate it in the future.
another study delivered experimental results of different BLOB
workloads of several DBMSs, discussing the performance VII. S UMMARY
bottlenecks of these systems [77]. SQLite is the only DBMS In this paper, we demonstrate that DBMSs can be more
optimized for BLOB operations, and its team even suggest that efficient than file systems in handling BLOBs. To achieve this,
SQLite can replace file systems for such tasks [78, 25]. Still, we introduce a comprehensive design for allocating and log-
as discussed throughout this paper, there are many opportuni- ging large objects in DBMSs. Our performance study shows
ties for improvement in SQLite, an observation that aligns that our proposed approach successfully outperforms many
with findings from numerous previous studies [79, 13, 2]. popular file systems while ensuring transactional consistency
Nevertheless, with our proposed techniques, we challenge and durability for large objects. Moreover, FUSE integration
the conventional wisdom and prove that DBMSs can provide allows external apps to access BLOBs similarly to file systems,
superior performance to file systems for BLOB management. paving the way toward a unified storage system for objects of
Networks. As Section V-B shows, networking is one primary arbitrary size. Our implementation is open source and available
overhead of MySQL and PostgreSQL, partially explaining at https://github.com/leanstore/leanstore/tree/blob.
R EFERENCES dev.mysql.com/doc/refman/8.0/en/storage-requirements.
[1] W. Jannen, J. Yuan, Y. Zhan, A. Akshintala, J. Esmet, html, 2023.
Y. Jiao, A. Mittal, P. Pandey, P. Reddy, L. Walsh, M. A. [22] “Mysql index prefix limits,” https://dev.mysql.com/doc/
Bender, M. Farach-Colton, R. Johnson, B. C. Kusz- refman/8.0/en/column-indexes.html, 2023.
maul, and D. E. Porter, “Betrfs: A right-optimized write- [23] “MySQL/InnoDB Redo Log for LOB,”
optimized file system,” in FAST. USENIX Association, https://github.com/mysql/mysql-server/blob/8.0/storage/
2015, pp. 301–315. innobase/include/lob0zip.h#L78, 2023.
[2] K. P. Gaffney, M. Prammer, L. C. Brasfield, D. R. Hipp, [24] “Large object in MySQL/InnoDB,” https://www.percona.
D. R. Kennedy, and J. M. Patel, “Sqlite: Past, present, com/blog/how-innodb-handles-text-blob-columns/,
and future,” Proc. VLDB Endow., vol. 15, no. 12, pp. 2023.
3535–3547, 2022. [25] “SQLite: 35% Faster Than The Filesystem,” https://www.
[3] A. Mathur, M. Cao, S. Bhattacharya, A. Dilger, sqlite.org/fasterthanfs.html, 2022.
A. Tomas, and L. Vivier, “The new ext4 filesystem: [26] H. Mühleisen, “Cidr keynote,” https://www.youtube.com/
current status and future plans,” in Proceedings of the watch?v=dv4A2LIFG80?t=1811, 2023.
Linux symposium, vol. 2. Citeseer, 2007, pp. 21–33. [27] M. Cao, T. Y. Tso, B. Pulavarty, S. Bhattacharya, A. Dil-
[4] “Ext4 Howto,” https://ext4.wiki.kernel.org/index.php/ ger, and A. Tomas, “State of the art: Where we are with
Ext4_Howto, 2019. the ext3 filesystem,” in Proceedings of the Ottawa Linux
[5] “ext4(5) — Linux manual page,” https://man7.org/linux/ Symposium (OLS). Citeseer, 2005, pp. 69–96.
man-pages/man5/ext4.5.html, 2023. [28] M. Haubenschild, C. Sauer, T. Neumann, and V. Leis,
[6] “PostgreSQL TOAST format,” https://www.postgresql. “Rethinking logging, checkpoints, and recovery for high-
org/docs/current/storage-toast.html, 2023. performance storage engines,” in SIGMOD Conference.
[7] “PostgreSQL Release notes 9.3,” https://www.postgresql. ACM, 2020, pp. 877–892.
org/docs/9.3/release-9-3.html, 2023. [29] J. Park, G. Oh, and S. Lee, “SQL statement logging for
[8] “Postgresql index tuple size limit,” https: making sqlite truly lite,” Proc. VLDB Endow., vol. 11,
//github.com/postgres/postgres/blob/master/src/include/ no. 4, pp. 513–525, 2017.
access/itup.h#L71, 2023. [30] G. Haas, M. Haubenschild, and V. Leis, “Exploiting
[9] “PostgreSQL pg_largeobject,” https://www.postgresql. directly-attached nvme arrays in DBMS,” in CIDR.
org/docs/current/catalog-pg-largeobject.html, 2023. www.cidrdb.org, 2020.
[10] “Large object in PostgreSQL,” https://pgpedia.info/l/ [31] D. Kim, C. Park, S. Lee, and B. Nam, “Bolt: Barrier-
large-object.html, 2023. optimized lsm-tree,” in Middleware. ACM, 2020, pp.
[11] “Sqlite file format,” https://www.sqlite.org/fileformat2. 119–133.
html, 2004. [32] M. Kang, S. Choi, G. Oh, and S. W. Lee, “2r: Efficiently
[12] “Limits In SQLite,” https://www.sqlite.org/limits.html, isolating cold pages in flash storages,” Proc. VLDB
2023. Endow., vol. 13, no. 11, pp. 2004–2017, 2020.
[13] “Sqlite clustered indexes and the without rowid optimiza- [33] “AWS S3 Actions,” https://docs.aws.amazon.com/
tion,” https://www.sqlite.org/withoutrowid.html, 2023. AmazonS3/latest/API/API_Operations.html, 2023.
[14] “SQLite WAL mode,” https://sqlite.org/wal.html, 2022. [34] D. S. Hirschberg, “A class of dynamic memory alloca-
[15] D. Korotkevitch, Pro SQL Server Internals. Apress, tion algorithms,” Communications of the ACM, vol. 16,
2016. no. 10, pp. 615–618, 1973.
[16] “SQL Server: binary and varbinary (Transact-SQL),” [35] K. Kunchithapadam, W. Zhang, A. Ganesh, and
https://learn.microsoft.com/en-us/sql/t-sql/data-types/ N. Mukherjee, “Oracle database filesystem,” in SIGMOD
binary-and-varbinary-transact-sql, 2023. Conference. ACM, 2011, pp. 1149–1160.
[17] “Sqlserver error code 1919,” http://www. [36] “Tesseract ocr,” https://github.com/tesseract-ocr/
sql-server-helper.com/error-messages/msg-1501-2000. tesseract, 2023.
aspx, 2023. [37] R. Smith, “An overview of the tesseract ocr engine,” in
[18] “Sqlserver create index,” https://learn.microsoft.com/ Ninth international conference on document analysis and
en-us/sql/t-sql/statements/create-index-transact-sql, recognition (ICDAR 2007), vol. 2. IEEE, 2007, pp. 629–
2023. 633.
[19] R. Sears, C. van Ingen, and J. Gray, “To BLOB or [38] G. Bradski, “The opencv library.” Dr. Dobb’s Journal:
not to BLOB: large object storage in a database or a Software Tools for the Professional Programmer, vol. 25,
filesystem?” CoRR, vol. abs/cs/0701168, 2007. no. 11, pp. 120–123, 2000.
[20] “Externally Stored Fields in MySQL/InnoDB,” https: [39] “io — Core tools for working with streams,” https://docs.
//dev.mysql.com/doc/refman/8.0/en/innodb-row-format. python.org/3.11/library/io.html, 2023.
html, 2023. [40] “Filesystem in Userspace,” https://github.com/libfuse/
[21] “MySQL Data Type Storage Requirements,” https:// libfuse, 2023.
[41] B. K. R. Vangoor, V. Tarasov, and E. Zadok, “To FUSE
or not to FUSE: performance of user-space file systems,” www.cidrdb.org, 2020.
in FAST. USENIX Association, 2017, pp. 59–72. [59] F. M. Schuhknecht, J. Dittrich, and A. Sharma, “RUMA
[42] V. Leis, A. Kemper, and T. Neumann, “The adaptive has it: Rewired user-space memory access is possible!”
radix tree: Artful indexing for main-memory databases,” Proc. VLDB Endow., vol. 9, no. 10, pp. 768–779, 2016.
in ICDE. IEEE Computer Society, 2013, pp. 38–49. [60] A. Crotty, V. Leis, and A. Pavlo, “Are you sure you want
[43] S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash to use MMAP in your database management system?” in
system,” Decentralized business review, p. 21260, 2008. CIDR. www.cidrdb.org, 2022.
[44] D. Yaga, P. Mell, N. Roby, and K. Scarfone, “Blockchain [61] “Pinterest help center: Review ad specs,” https://help.
technology overview,” CoRR, vol. abs/1906.11078, 2019. pinterest.com/en/business/article/pinterest-product-specs,
[45] “PostgreSQL: Indexes on Expressions,” https://www. 2023.
postgresql.org/docs/current/indexes-expressional.html, [62] “LinkedIn: Media file types,” https://www.linkedin.com/
2024. help/linkedin/answer/a564109, 2023.
[46] P. A. Bernstein, V. Hadzilacos, and N. Goodman, Con- [63] “How to post photos or GIFs on Twitter,” https://help.
currency Control and Recovery in Database Systems. twitter.com/en/using-twitter/tweeting-gifs-and-pictures,
Addison-Wesley, 1987. 2023.
[47] H. T. Kung and J. T. Robinson, “On optimistic methods [64] T. Wang and R. Johnson, “Scalable logging through
for concurrency control,” ACM Trans. Database Syst., emerging non-volatile memory,” Proc. VLDB Endow.,
vol. 6, no. 2, pp. 213–226, 1981. vol. 7, no. 10, pp. 865–876, 2014.
[48] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden, [65] A. Alhomssi, M. Haubenschild, and V. Leis, “The evo-
“Speedy transactions in multicore in-memory databases,” lution of leanstore,” in BTW, ser. LNI, vol. P-331.
in SOSP. ACM, 2013, pp. 18–32. Gesellschaft für Informatik e.V., 2023, pp. 259–281.
[49] H. Berenson, P. A. Bernstein, J. Gray, J. Melton, E. J. [66] J. Gray and A. Reuter, Transaction Processing: Concepts
O’Neil, and P. E. O’Neil, “A critique of ANSI SQL and Techniques. Morgan Kaufmann, 1993.
isolation levels,” in SIGMOD Conference. ACM Press, [67] “PostgreSQL source code,” https://github.com/postgres/
1995, pp. 1–10. postgres/tree/REL_15_3, 2023.
[50] A. D. Fekete, E. J. O’Neil, and P. E. O’Neil, “A read-only [68] “MySQL source code,” https://github.com/mysql/
transaction anomaly under snapshot isolation,” SIGMOD mysql-server/tree/mysql-cluster-8.0.33, 2023.
Rec., vol. 33, no. 3, pp. 12–14, 2004. [69] “SQLite source code,” https://github.com/sqlite/sqlite/
[51] R. Ramakrishnan and J. Gehrke, Database management tree/version-3.40.1, 2022.
systems (3. ed.). McGraw-Hill, 2003. [70] M. Raasveldt and H. Mühleisen, “Duckdb: an em-
[52] Z. Guo, K. Wu, C. Yan, and X. Yu, “Releasing locks beddable analytical database,” in SIGMOD Conference.
as early as you can: Reducing contention of hotspots by ACM, 2019, pp. 1981–1984.
violating two-phase locking,” in SIGMOD Conference. [71] “PostgreSQL Server Configuration,” https://www.
ACM, 2021, pp. 658–670. postgresql.org/docs/15/runtime-config-resource.html,
[53] L.-D. Nguyen, S. W. Lee, and B. Nam, “In-page shad- 2023.
owing and two-version timestamp ordering for mobile [72] A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishi-
dbmss,” Proc. VLDB Endow., vol. 15, no. 11, pp. 2402– moto, and G. Peck, “Scalability in the XFS file system,”
2414, 2022. in USENIX Annual Technical Conference. USENIX
[54] C. Ye, W. Hwang, K. Chen, and X. Yu, “Polaris: Enabling Association, 1996, pp. 1–14.
transaction priority in optimistic concurrency control,” [73] O. Rodeh, J. Bacik, and C. Mason, “BTRFS: the linux b-
Proc. ACM Manag. Data, vol. 1, no. 1, pp. 44:1–44:24, tree filesystem,” ACM Trans. Storage, vol. 9, no. 3, p. 9,
2023. 2013.
[55] V. Leis, A. Alhomssi, T. Ziegler, Y. Loeck, and C. Di- [74] C. Lee, D. Sim, J. Y. Hwang, and S. Cho, “F2FS: A
etrich, “Virtual-memory assisted buffer management,” new file system for flash storage,” in FAST. USENIX
Proc. ACM Manag. Data, vol. 1, no. 1, pp. 7:1–7:25, Association, 2015, pp. 273–286.
2023. [75] “enwiki dump progress,” https://dumps.wikimedia.org/
[56] G. Graefe, H. Volos, H. Kimura, H. A. Kuno, J. Tucek, enwiki/latest/, Jun. 2023.
M. Lillibridge, and A. C. Veitch, “In-memory perfor- [76] R. Bayer and K. Unterauer, “Prefix b-trees,” ACM Trans.
mance for big data,” PVLDB, vol. 8, no. 1, pp. 37–48, Database Syst., vol. 2, no. 1, pp. 11–26, 1977.
2014. [77] S. Stancu-Mara and P. Baumann, “A comparative bench-
[57] V. Leis, M. Haubenschild, A. Kemper, and T. Neumann, mark of large objects in relational databases,” in IDEAS,
“Leanstore: In-memory data management beyond main ser. ACM International Conference Proceeding Series,
memory,” in ICDE. IEEE Computer Society, 2018, pp. vol. 299. ACM, 2008, pp. 277–284.
185–196. [78] “What If OpenDocument Used SQLite?” https://www.
[58] T. Neumann and M. J. Freitag, “Umbra: A disk- sqlite.org/affcase1.html, 2023.
based system with in-memory performance,” in CIDR. [79] “Internal Versus External BLOBs in SQLite,” https://
www.sqlite.org/intern-v-extern-blob.html, 2011. coherence for immutable data with nvme over fabrics,”
[80] M. Raasveldt and H. Mühleisen, “Don’t hold my data in CLOUD. IEEE, 2023, pp. 394–400.
hostage - A case for client protocol redesign,” Proc. [88] D. Han and B. Nam, “Improving access to HDFS using
VLDB Endow., vol. 10, no. 10, pp. 1022–1033, 2017. nvmeof,” in CLUSTER. IEEE, 2019, pp. 1–2.
[81] M. Butrovich, K. Ramanathan, J. Rollinson, W. S. Lim, [89] P. Fent, A. van Renen, A. Kipf, V. Leis, T. Neumann, and
W. Zhang, J. Sherry, and A. Pavlo, “Tigger: A database A. Kemper, “Low-latency communication for fast DBMS
proxy that bounces with user-bypass,” Proc. VLDB En- using RDMA and shared memory,” in ICDE. IEEE,
dow., vol. 16, no. 11, pp. 3335–3348, 2023. 2020, pp. 1477–1488.
[82] C. Binnig, A. Crotty, A. Galakatos, T. Kraska, and [90] A. Conway, E. Knorr, Y. Jiao, M. A. Bender, W. Jannen,
E. Zamanian, “The end of slow networks: It’s time for R. Johnson, D. E. Porter, and M. Farach-Colton, “Filesys-
a redesign,” Proc. VLDB Endow., vol. 9, no. 7, pp. 528– tem aging: It’s more usage than fullness,” in HotStorage.
539, 2016. USENIX Association, 2019.
[83] F. Li, S. Das, M. Syamala, and V. R. Narasayya, [91] A. Conway, A. Bakshi, Y. Jiao, W. Jannen, Y. Zhan,
“Accelerating relational databases by leveraging remote J. Yuan, M. A. Bender, R. Johnson, B. C. Kuszmaul,
memory and RDMA,” in SIGMOD Conference. ACM, D. E. Porter, and M. Farach-Colton, “File systems
2016, pp. 355–370. fated for senescence? nonsense, says science!” in FAST.
[84] T. Ziegler, J. Nelson-Slivon, V. Leis, and C. Binnig, USENIX Association, 2017, pp. 45–58.
“Design guidelines for correct, efficient, and scalable [92] R. Kadekodi, S. Kadekodi, S. Ponnapalli, H. Shirwadkar,
synchronization using one-sided RDMA,” Proc. ACM G. R. Ganger, A. Kolli, and V. Chidambaram, “Winefs:
Manag. Data, vol. 1, no. 2, pp. 131:1–131:26, 2023. a hugepage-aware file system for persistent memory that
[85] T. Ziegler, V. Leis, and C. Binnig, “RDMA communci- ages gracefully,” in SOSP. ACM, 2021, pp. 804–818.
ation patterns,” Datenbank-Spektrum, vol. 20, no. 3, pp. [93] S. Kadekodi, V. Nagarajan, and G. R. Ganger, “Geriatrix:
199–210, 2020. Aging what you see and what you don’t see. A file
[86] H. Li, S. Jiang, C. Chen, A. Raina, X. Zhu, C. Luo, system aging approach for modern storage systems,”
and A. Cidon, “Rubbledb: Cpu-efficient replication with in USENIX Annual Technical Conference. USENIX
nvme-of,” in USENIX Annual Technical Conference. Association, 2018, pp. 691–704.
USENIX Association, 2023, pp. 689–703. [94] J. Park and Y. I. Eom, “Filesystem fragmentation on
[87] T. A. Nguyen, H. Jeon, D. Han, D. Bae, Y. Yu, K. Kim, modern storage systems,” ACM Transactions on Com-
S. Park, J. Jeong, and B. Nam, “Nvme-driven lazy cache puter Systems, 2023.

You might also like