|
1 |
| -$Header: /cvsroot/pgsql/src/backend/storage/buffer/README,v 1.4 2003/10/31 22:48:08 tgl Exp $ |
| 1 | +$Header: /cvsroot/pgsql/src/backend/storage/buffer/README,v 1.5 2003/11/14 04:32:11 wieck Exp $ |
2 | 2 |
|
3 | 3 | Notes about shared buffer access rules
|
4 | 4 | --------------------------------------
|
@@ -95,3 +95,155 @@ concurrent VACUUM. The current implementation only supports a single
|
95 | 95 | waiter for pin-count-1 on any particular shared buffer. This is enough
|
96 | 96 | for VACUUM's use, since we don't allow multiple VACUUMs concurrently on a
|
97 | 97 | single relation anyway.
|
| 98 | + |
| 99 | + |
| 100 | +Buffer replacement strategy interface: |
| 101 | + |
| 102 | +The two files freelist.c and buf_table.c contain the buffer cache |
| 103 | +replacement strategy. The interface to the strategy is: |
| 104 | + |
| 105 | + BufferDesc * |
| 106 | + StrategyBufferLookup(BufferTag *tagPtr, bool recheck) |
| 107 | + |
| 108 | + This is allways the first call made by the buffer manager |
| 109 | + to check if a disk page is in memory. If so, the function |
| 110 | + returns the buffer descriptor and no further action is |
| 111 | + required. |
| 112 | + |
| 113 | + If the page is not in memory, StrategyBufferLookup() |
| 114 | + returns NULL. |
| 115 | + |
| 116 | + The flag recheck tells the strategy that this is a second |
| 117 | + lookup after flushing a dirty block. If the buffer manager |
| 118 | + has to evict another buffer, he will release the bufmgr lock |
| 119 | + while doing the write IO. During this time, another backend |
| 120 | + could possibly fault in the same page this backend is after, |
| 121 | + so we have to check again after the IO is done if the page |
| 122 | + is in memory now. |
| 123 | + |
| 124 | + BufferDesc * |
| 125 | + StrategyGetBuffer(void) |
| 126 | + |
| 127 | + The buffer manager calls this function to get an unpinned |
| 128 | + cache buffer who's content can be evicted. The returned |
| 129 | + buffer might be empty, clean or dirty. |
| 130 | + |
| 131 | + The returned buffer is only a cadidate for replacement. |
| 132 | + It is possible that while the buffer is written, another |
| 133 | + backend finds and modifies it, so that it is dirty again. |
| 134 | + The buffer manager will then call StrategyGetBuffer() |
| 135 | + again to ask for another candidate. |
| 136 | + |
| 137 | + void |
| 138 | + StrategyReplaceBuffer(BufferDesc *buf, Relation rnode, |
| 139 | + BlockNumber blockNum) |
| 140 | + |
| 141 | + Called by the buffer manager at the time it is about to |
| 142 | + change the association of a buffer with a disk page. |
| 143 | + |
| 144 | + Before this call, StrategyBufferLookup() still has to find |
| 145 | + the buffer even if it was returned by StrategyGetBuffer() |
| 146 | + as a candidate for replacement. |
| 147 | + |
| 148 | + After this call, this buffer must be returned for a |
| 149 | + lookup of the new page identified by rnode and blockNum. |
| 150 | + |
| 151 | + void |
| 152 | + StrategyInvalidateBuffer(BufferDesc *buf) |
| 153 | + |
| 154 | + Called from various parts to inform that the content of |
| 155 | + this buffer has been thrown away. This happens for example |
| 156 | + in the case of dropping a relation. |
| 157 | + |
| 158 | + The buffer must be clean and unpinned on call. |
| 159 | + |
| 160 | + If the buffer associated with a disk page, StrategyBufferLookup() |
| 161 | + must not return it for this page after the call. |
| 162 | + |
| 163 | + void |
| 164 | + StrategyHintVacuum(bool vacuum_active) |
| 165 | + |
| 166 | + Because vacuum reads all relations of the entire database |
| 167 | + through the buffer manager, it can greatly disturb the |
| 168 | + buffer replacement strategy. This function is used by vacuum |
| 169 | + to inform that all subsequent buffer lookups are caused |
| 170 | + by vacuum scanning relations. |
| 171 | + |
| 172 | + |
| 173 | +Buffer replacement strategy: |
| 174 | + |
| 175 | +The buffer replacement strategy actually used in freelist.c is a |
| 176 | +version of the Adaptive Replacement Cache (ARC) special tailored for |
| 177 | +PostgreSQL. |
| 178 | + |
| 179 | +The algorithm works as follows: |
| 180 | + |
| 181 | + C is the size of the cache in number of pages (conf: shared_buffers) |
| 182 | + ARC uses 2*C Cache Directory Blocks (CDB). A cache directory block |
| 183 | + is allwayt associated with one unique file page and "can" point to |
| 184 | + one shared buffer. |
| 185 | + |
| 186 | + All file pages known in by the directory are managed in 4 LRU lists |
| 187 | + named B1, T1, T2 and B2. The T1 and T2 lists are the "real" cache |
| 188 | + entries, linking a file page to a memory buffer where the page is |
| 189 | + currently cached. Consequently T1len+T2len <= C. B1 and B2 are |
| 190 | + ghost cache directories that extend T1 and T2 so that the strategy |
| 191 | + remembers pages longer. The strategy tries to keep B1len+T1len and |
| 192 | + B2len+T2len both at C. T1len and T2 len vary over the runtime |
| 193 | + depending on the lookup pattern and its resulting cache hits. The |
| 194 | + desired size of T1len is called T1target. |
| 195 | + |
| 196 | + Assuming we have a full cache, one of 5 cases happens on a lookup: |
| 197 | + |
| 198 | + MISS On a cache miss, depending on T1target and the actual T1len |
| 199 | + the LRU buffer of T1 or T2 is evicted. Its CDB is removed |
| 200 | + from the T list and added as MRU of the corresponding B list. |
| 201 | + The now free buffer is replaced with the requested page |
| 202 | + and added as MRU of T1. |
| 203 | + |
| 204 | + T1 hit The T1 CDB is moved to the MRU position of the T2 list. |
| 205 | + |
| 206 | + T2 hit The T2 CDB is moved to the MRU position of the T2 list. |
| 207 | + |
| 208 | + B1 hit This means that a buffer that was evicted from the T1 |
| 209 | + list is now requested again, indicating that T1target is |
| 210 | + too small (otherwise it would still be in T1 and thus in |
| 211 | + memory). The strategy raises T1target, evicts a buffer |
| 212 | + depending on T1target and T1len and places the CDB at |
| 213 | + MRU of T2. |
| 214 | + |
| 215 | + B2 hit This means the opposite of B1, the T2 list is probably too |
| 216 | + small. So the strategy lowers T1target, evicts a buffer |
| 217 | + and places the CDB at MRU of T2. |
| 218 | + |
| 219 | + Thus, every page that is found on lookup in any of the four lists |
| 220 | + ends up as the MRU of the T2 list. The T2 list therefore is the |
| 221 | + "frequency" cache, holding frequently requested pages. |
| 222 | + |
| 223 | + Every page that is seen for the first time ends up as the MRU of |
| 224 | + the T1 list. The T1 list is the "recency" cache, holding recent |
| 225 | + newcomers. |
| 226 | + |
| 227 | + The tailoring done for PostgreSQL has to do with the way, the |
| 228 | + query executor works. A typical UPDATE or DELETE first scans the |
| 229 | + relation, searching for the tuples and then calls heap_update() or |
| 230 | + heap_delete(). This causes at least 2 lookups for the block in the |
| 231 | + same statement. In the case of multiple matches in one block even |
| 232 | + more often. As a result, every block touched in an UPDATE or DELETE |
| 233 | + would directly jump into the T2 cache, which is wrong. To prevent |
| 234 | + this the strategy remembers which transaction added a buffer to the |
| 235 | + T1 list and will not promote it from there into the T2 cache during |
| 236 | + the same transaction. |
| 237 | + |
| 238 | + Another specialty is the change of the strategy during VACUUM. |
| 239 | + Lookups during VACUUM do not represent application needs, so it |
| 240 | + would be wrong to change the cache balance T1target due to that |
| 241 | + or to cause massive cache evictions. Therefore, a page read in to |
| 242 | + satisfy vacuum (not those that actually cause a hit on any list) |
| 243 | + is placed at the LRU position of the T1 list, for immediate |
| 244 | + reuse. Since Vacuum usually requests many pages very fast, the |
| 245 | + natural side effect of this is that it will get back the very |
| 246 | + buffers it filled and possibly modified on the next call and will |
| 247 | + therefore do it's work in a few shared memory buffers, while using |
| 248 | + whatever it finds in the cache already. |
| 249 | + |
0 commit comments