|
| 1 | +$Header: /cvsroot/pgsql/src/backend/storage/buffer/README,v 1.1 2001/07/06 21:04:25 tgl Exp $ |
| 2 | + |
| 3 | +Notes about shared buffer access rules |
| 4 | +-------------------------------------- |
| 5 | + |
| 6 | +There are two separate access control mechanisms for shared disk buffers: |
| 7 | +reference counts (a/k/a pin counts) and buffer locks. (Actually, there's |
| 8 | +a third level of access control: one must hold the appropriate kind of |
| 9 | +lock on a relation before one can legally access any page belonging to |
| 10 | +the relation. Relation-level locks are not discussed here.) |
| 11 | + |
| 12 | +Pins: one must "hold a pin on" a buffer (increment its reference count) |
| 13 | +before being allowed to do anything at all with it. An unpinned buffer is |
| 14 | +subject to being reclaimed and reused for a different page at any instant, |
| 15 | +so touching it is unsafe. Typically a pin is acquired via ReadBuffer and |
| 16 | +released via WriteBuffer (if one modified the page) or ReleaseBuffer (if not). |
| 17 | +It is OK and indeed common for a single backend to pin a page more than |
| 18 | +once concurrently; the buffer manager handles this efficiently. It is |
| 19 | +considered OK to hold a pin for long intervals --- for example, sequential |
| 20 | +scans hold a pin on the current page until done processing all the tuples |
| 21 | +on the page, which could be quite a while if the scan is the outer scan of |
| 22 | +a join. Similarly, btree index scans hold a pin on the current index page. |
| 23 | +This is OK because normal operations never wait for a page's pin count to |
| 24 | +drop to zero. (Anything that might need to do such a wait is instead |
| 25 | +handled by waiting to obtain the relation-level lock, which is why you'd |
| 26 | +better hold one first.) Pins may not be held across transaction |
| 27 | +boundaries, however. |
| 28 | + |
| 29 | +Buffer locks: there are two kinds of buffer locks, shared and exclusive, |
| 30 | +which act just as you'd expect: multiple backends can hold shared locks on |
| 31 | +the same buffer, but an exclusive lock prevents anyone else from holding |
| 32 | +either shared or exclusive lock. (These can alternatively be called READ |
| 33 | +and WRITE locks.) These locks are short-term: they should not be held for |
| 34 | +long. They are implemented as per-buffer spinlocks, so another backend |
| 35 | +trying to acquire a competing lock will spin as long as you hold yours! |
| 36 | +Buffer locks are acquired and released by LockBuffer(). It will *not* work |
| 37 | +for a single backend to try to acquire multiple locks on the same buffer. |
| 38 | +One must pin a buffer before trying to lock it. |
| 39 | + |
| 40 | +Buffer access rules: |
| 41 | + |
| 42 | +1. To scan a page for tuples, one must hold a pin and either shared or |
| 43 | +exclusive lock. To examine the commit status (XIDs and status bits) of |
| 44 | +a tuple in a shared buffer, one must likewise hold a pin and either shared |
| 45 | +or exclusive lock. |
| 46 | + |
| 47 | +2. Once one has determined that a tuple is interesting (visible to the |
| 48 | +current transaction) one may drop the buffer lock, yet continue to access |
| 49 | +the tuple's data for as long as one holds the buffer pin. This is what is |
| 50 | +typically done by heap scans, since the tuple returned by heap_fetch |
| 51 | +contains a pointer to tuple data in the shared buffer. Therefore the |
| 52 | +tuple cannot go away while the pin is held (see rule #5). Its state could |
| 53 | +change, but that is assumed not to matter after the initial determination |
| 54 | +of visibility is made. |
| 55 | + |
| 56 | +3. To add a tuple or change the xmin/xmax fields of an existing tuple, |
| 57 | +one must hold a pin and an exclusive lock on the containing buffer. |
| 58 | +This ensures that no one else might see a partially-updated state of the |
| 59 | +tuple. |
| 60 | + |
| 61 | +4. It is considered OK to update tuple commit status bits (ie, OR the |
| 62 | +values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or |
| 63 | +HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and |
| 64 | +pin on a buffer. This is OK because another backend looking at the tuple |
| 65 | +at about the same time would OR the same bits into the field, so there |
| 66 | +is little or no risk of conflicting update; what's more, if there did |
| 67 | +manage to be a conflict it would merely mean that one bit-update would |
| 68 | +be lost and need to be done again later. These four bits are only hints |
| 69 | +(they cache the results of transaction status lookups in pg_log), so no |
| 70 | +great harm is done if they get reset to zero by conflicting updates. |
| 71 | + |
| 72 | +5. To physically remove a tuple or compact free space on a page, one |
| 73 | +must hold a pin and an exclusive lock, *and* observe while holding the |
| 74 | +exclusive lock that the buffer's shared reference count is one (ie, |
| 75 | +no other backend holds a pin). If these conditions are met then no other |
| 76 | +backend can perform a page scan until the exclusive lock is dropped, and |
| 77 | +no other backend can be holding a reference to an existing tuple that it |
| 78 | +might expect to examine again. Note that another backend might pin the |
| 79 | +buffer (increment the refcount) while one is performing the cleanup, but |
| 80 | +it won't be able to actually examine the page until it acquires shared |
| 81 | +or exclusive lock. |
| 82 | + |
| 83 | + |
| 84 | +As of 7.1, the only operation that removes tuples or compacts free space is |
| 85 | +(oldstyle) VACUUM. It does not have to implement rule #5 directly, because |
| 86 | +it instead acquires exclusive lock at the relation level, which ensures |
| 87 | +indirectly that no one else is accessing pages of the relation at all. |
| 88 | + |
| 89 | +To implement concurrent VACUUM we will need to make it obey rule #5 fully. |
| 90 | +To do this, we'll create a new buffer manager operation |
| 91 | +LockBufferForCleanup() that gets an exclusive lock and then checks to see |
| 92 | +if the shared pin count is currently 1. If not, it releases the exclusive |
| 93 | +lock (but not the caller's pin) and waits until signaled by another backend, |
| 94 | +whereupon it tries again. The signal will occur when UnpinBuffer |
| 95 | +decrements the shared pin count to 1. As indicated above, this operation |
| 96 | +might have to wait a good while before it acquires lock, but that shouldn't |
| 97 | +matter much for concurrent VACUUM. The current implementation only |
| 98 | +supports a single waiter for pin-count-1 on any particular shared buffer. |
| 99 | +This is enough for VACUUM's use, since we don't allow multiple VACUUMs |
| 100 | +concurrently on a single relation anyway. |
0 commit comments