Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 9cd00c4

Browse files
committed
Checkpoint sorting and balancing.
Up to now checkpoints were written in the order they're in the BufferDescriptors. That's nearly random in a lot of cases, which performs badly on rotating media, but even on SSDs it causes slowdowns. To avoid that, sort checkpoints before writing them out. We currently sort by tablespace, relfilenode, fork and block number. One of the major reasons that previously wasn't done, was fear of imbalance between tablespaces. To address that balance writes between tablespaces. The other prime concern was that the relatively large allocation to sort the buffers in might fail, preventing checkpoints from happening. Thus pre-allocate the required memory in shared memory, at server startup. This particularly makes it more efficient to have checkpoint flushing enabled, because that'll often result in a lot of writes that can be coalesced into one flush. Discussion: alpine.DEB.2.10.1506011320000.28433@sto Author: Fabien Coelho and Andres Freund
1 parent 428b1d6 commit 9cd00c4

File tree

5 files changed

+277
-44
lines changed

5 files changed

+277
-44
lines changed

src/backend/storage/buffer/README

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -267,11 +267,6 @@ only needs to take the lock long enough to read the variable value, not
267267
while scanning the buffers. (This is a very substantial improvement in
268268
the contention cost of the writer compared to PG 8.0.)
269269

270-
During a checkpoint, the writer's strategy must be to write every dirty
271-
buffer (pinned or not!). We may as well make it start this scan from
272-
nextVictimBuffer, however, so that the first-to-be-written pages are the
273-
ones that backends might otherwise have to write for themselves soon.
274-
275270
The background writer takes shared content lock on a buffer while writing it
276271
out (and anyone else who flushes buffer contents to disk must do so too).
277272
This ensures that the page image transferred to disk is reasonably consistent.

src/backend/storage/buffer/buf_init.c

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ LWLockMinimallyPadded *BufferIOLWLockArray = NULL;
2424
LWLockTranche BufferIOLWLockTranche;
2525
LWLockTranche BufferContentLWLockTranche;
2626
WritebackContext BackendWritebackContext;
27+
CkptSortItem *CkptBufferIds;
2728

2829

2930
/*
@@ -70,7 +71,8 @@ InitBufferPool(void)
7071
{
7172
bool foundBufs,
7273
foundDescs,
73-
foundIOLocks;
74+
foundIOLocks,
75+
foundBufCkpt;
7476

7577
/* Align descriptors to a cacheline boundary. */
7678
BufferDescriptors = (BufferDescPadded *)
@@ -104,10 +106,21 @@ InitBufferPool(void)
104106
LWLockRegisterTranche(LWTRANCHE_BUFFER_CONTENT,
105107
&BufferContentLWLockTranche);
106108

107-
if (foundDescs || foundBufs || foundIOLocks)
109+
/*
110+
* The array used to sort to-be-checkpointed buffer ids is located in
111+
* shared memory, to avoid having to allocate significant amounts of
112+
* memory at runtime. As that'd be in the middle of a checkpoint, or when
113+
* the checkpointer is restarted, memory allocation failures would be
114+
* painful.
115+
*/
116+
CkptBufferIds = (CkptSortItem *)
117+
ShmemInitStruct("Checkpoint BufferIds",
118+
NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
119+
120+
if (foundDescs || foundBufs || foundIOLocks || foundBufCkpt)
108121
{
109122
/* should find all of these, or none of them */
110-
Assert(foundDescs && foundBufs && foundIOLocks);
123+
Assert(foundDescs && foundBufs && foundIOLocks && foundBufCkpt);
111124
/* note: this path is only taken in EXEC_BACKEND case */
112125
}
113126
else
@@ -190,5 +203,8 @@ BufferShmemSize(void)
190203
/* to allow aligning the above */
191204
size = add_size(size, PG_CACHE_LINE_SIZE);
192205

206+
/* size of checkpoint sort array in bufmgr.c */
207+
size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
208+
193209
return size;
194210
}

0 commit comments

Comments
 (0)