Fully managed and integrated with Google Cloud, Azure, and AWS.
Build the fastest, most reliable GenAI apps with our advanced vector database.
Self-managed software with enterprise-grade compliance and reliability.
Synchronize data in near-real time to make data fast—without writing code.
In-memory database for caching & streaming.
Details about garbage collection
All of the above means that if there are a lot of updates and deletes, a large portion of our inverted index will become garbage, both slowing things down and consuming unnecessary memory.
You want to optimize the index, but you also don't want to disturb normal operation. This means that optimization or garbage collection should be a background process that's non-intrusive. It only needs to be faster than the deletion rate over a sufficient period of time so that you don't create more garbage than you can collect.
A single-term inverted index is an array of blocks, each of which contains an encoded list of records; e.g., a document id delta plus other data depending on the index encoding scheme. When some of these records refer to deleted documents this is called garbage.
The algorithm is simple:
Pseudo code:
foreach index_block as block:
reader = new_reader(block)
writer = new_write(block)
garbage = 0
while not reader.end():
record = reader.decode_next()
if record.is_valid():
if garbage != 0:
# Write the record at the writer's tip with a newly calculated delta
writer.write_record(record)
else:
writer.advance(record.length)
else:
garbage += record.length
Numeric indexes are a tree of inverted indexes with a special encoding of (docId delta, value). This means the same algorithm can be applied to them, only traversing each inverted index object in the tree.
Information about FORK GC can be found in this blog.
Since v1.6, the FORK GC is the default GC policy and was proven very efficient both in cleaning the index and not reducing query and indexing performance, even for very write-internsive use cases.