Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Garbage collection

Details about garbage collection

The need for garbage collection

All of the above means that if there are a lot of updates and deletes, a large portion of our inverted index will become garbage, both slowing things down and consuming unnecessary memory.

You want to optimize the index, but you also don't want to disturb normal operation. This means that optimization or garbage collection should be a background process that's non-intrusive. It only needs to be faster than the deletion rate over a sufficient period of time so that you don't create more garbage than you can collect.

Garbage collecting a single-term index

A single-term inverted index is an array of blocks, each of which contains an encoded list of records; e.g., a document id delta plus other data depending on the index encoding scheme. When some of these records refer to deleted documents this is called garbage.

The algorithm is simple:

  1. Create a reader and writer for each block.
  2. Read each block's records one by one.
  3. If no record is invalid, do nothing.
  4. When a garbage record is found, the reader is advanced, but not the writer.
  5. When at least one garbage record is found, the next records are encoded to the writer, recalculating the deltas.

Pseudo code:

foreach index_block as block:
   
   reader = new_reader(block)
   writer = new_write(block)
   garbage = 0
   while not reader.end():
        record = reader.decode_next()
        if record.is_valid():
            if garbage != 0:
                # Write the record at the writer's tip with a newly calculated delta
                writer.write_record(record)
            else:
                writer.advance(record.length)
        else:
            garbage += record.length

GC on numeric indexes

Numeric indexes are a tree of inverted indexes with a special encoding of (docId delta, value). This means the same algorithm can be applied to them, only traversing each inverted index object in the tree.

FORK GC

Information about FORK GC can be found in this blog.

Since v1.6, the FORK GC is the default GC policy and was proven very efficient both in cleaning the index and not reducing query and indexing performance, even for very write-internsive use cases.

RATE THIS PAGE
Back to top ↑