Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 12915a5

Browse files
committed
Enhance checkpointer restartpoint statistics
Bhis commit introduces enhancements to the pg_stat_checkpointer view by adding three new columns: restartpoints_timed, restartpoints_req, and restartpoints_done. These additions aim to improve the visibility and monitoring of restartpoint processes on replicas. Previously, it was challenging to differentiate between successful and failed restartpoint requests. This limitation arises because restartpoints on replicas are dependent on checkpoint records from the primary, and cannot occur more frequently than these checkpoints. The new columns allow for clear distinction and tracking of restartpoint requests, their triggers, and successful completions. This enhancement aids database administrators and developers in better understanding and diagnosing issues related to restartpoint behavior, particularly in scenarios where restartpoint requests may fail. System catalog is changed. Catversion is bumped. Discussion: https://postgr.es/m/99b2ccd1-a77a-962a-0837-191cdf56c2b9%40inbox.ru Author: Anton A. Melnikov Reviewed-by: Kyotaro Horiguchi, Alexander Korotkov
1 parent 64e77b4 commit 12915a5

File tree

10 files changed

+134
-9
lines changed

10 files changed

+134
-9
lines changed

doc/src/sgml/monitoring.sgml

+27
Original file line numberDiff line numberDiff line change
@@ -2982,6 +2982,33 @@ description | Waiting for a newly initialized WAL file to reach durable storage
29822982
</para></entry>
29832983
</row>
29842984

2985+
<row>
2986+
<entry role="catalog_table_entry"><para role="column_definition">
2987+
<structfield>restartpoints_timed</structfield> <type>bigint</type>
2988+
</para>
2989+
<para>
2990+
Number of scheduled restartpoints due to timeout or after a failed attempt to perform it
2991+
</para></entry>
2992+
</row>
2993+
2994+
<row>
2995+
<entry role="catalog_table_entry"><para role="column_definition">
2996+
<structfield>restartpoints_req</structfield> <type>bigint</type>
2997+
</para>
2998+
<para>
2999+
Number of requested restartpoints
3000+
</para></entry>
3001+
</row>
3002+
3003+
<row>
3004+
<entry role="catalog_table_entry"><para role="column_definition">
3005+
<structfield>restartpoints_done</structfield> <type>bigint</type>
3006+
</para>
3007+
<para>
3008+
Number of restartpoints that have been performed
3009+
</para></entry>
3010+
</row>
3011+
29853012
<row>
29863013
<entry role="catalog_table_entry"><para role="column_definition">
29873014
<structfield>write_time</structfield> <type>double precision</type>

doc/src/sgml/wal.sgml

+33-6
Original file line numberDiff line numberDiff line change
@@ -655,14 +655,41 @@
655655
directory.
656656
Restartpoints can't be performed more frequently than checkpoints on the
657657
primary because restartpoints can only be performed at checkpoint records.
658-
A restartpoint is triggered when a checkpoint record is reached if at
659-
least <varname>checkpoint_timeout</varname> seconds have passed since the last
660-
restartpoint, or if WAL size is about to exceed
661-
<varname>max_wal_size</varname>. However, because of limitations on when a
662-
restartpoint can be performed, <varname>max_wal_size</varname> is often exceeded
663-
during recovery, by up to one checkpoint cycle's worth of WAL.
658+
A restartpoint can be demanded by a schedule or by an external request.
659+
The <structfield>restartpoints_timed</structfield> counter in the
660+
<link linkend="monitoring-pg-stat-checkpointer-view"><structname>pg_stat_checkpointer</structname></link>
661+
view counts the first ones while the <structfield>restartpoints_req</structfield>
662+
the second.
663+
A restartpoint is triggered by schedule when a checkpoint record is reached
664+
if at least <xref linkend="guc-checkpoint-timeout"/> seconds have passed since
665+
the last performed restartpoint or when the previous attempt to perform
666+
the restartpoint has failed. In the last case, the next restartpoint
667+
will be scheduled in 15 seconds.
668+
A restartpoint is triggered by request due to similar reasons like checkpoint
669+
but mostly if WAL size is about to exceed <xref linkend="guc-max-wal-size"/>
670+
However, because of limitations on when a restartpoint can be performed,
671+
<varname>max_wal_size</varname> is often exceeded during recovery,
672+
by up to one checkpoint cycle's worth of WAL.
664673
(<varname>max_wal_size</varname> is never a hard limit anyway, so you should
665674
always leave plenty of headroom to avoid running out of disk space.)
675+
The <structfield>restartpoints_done</structfield> counter in the
676+
<link linkend="monitoring-pg-stat-checkpointer-view"><structname>pg_stat_checkpointer</structname></link>
677+
view counts the restartpoints that have really been performed.
678+
</para>
679+
680+
<para>
681+
In some cases, when the WAL size on the primary increases quickly,
682+
for instance during massive INSERT,
683+
the <structfield>restartpoints_req</structfield> counter on the standby
684+
may demonstrate a peak growth.
685+
This occurs because requests to create a new restartpoint due to increased
686+
XLOG consumption cannot be performed because the safe checkpoint record
687+
since the last restartpoint has not yet been replayed on the standby.
688+
This behavior is normal and does not lead to an increase in system resource
689+
consumption.
690+
Only the <structfield>restartpoints_done</structfield>
691+
counter among the restartpoint-related ones indicates that noticeable system
692+
resources have been spent.
666693
</para>
667694

668695
<para>

src/backend/catalog/system_views.sql

+3
Original file line numberDiff line numberDiff line change
@@ -1141,6 +1141,9 @@ CREATE VIEW pg_stat_checkpointer AS
11411141
SELECT
11421142
pg_stat_get_checkpointer_num_timed() AS num_timed,
11431143
pg_stat_get_checkpointer_num_requested() AS num_requested,
1144+
pg_stat_get_checkpointer_restartpoints_timed() AS restartpoints_timed,
1145+
pg_stat_get_checkpointer_restartpoints_requested() AS restartpoints_req,
1146+
pg_stat_get_checkpointer_restartpoints_performed() AS restartpoints_done,
11441147
pg_stat_get_checkpointer_write_time() AS write_time,
11451148
pg_stat_get_checkpointer_sync_time() AS sync_time,
11461149
pg_stat_get_checkpointer_buffers_written() AS buffers_written,

src/backend/postmaster/checkpointer.c

+25-2
Original file line numberDiff line numberDiff line change
@@ -340,6 +340,8 @@ CheckpointerMain(void)
340340
pg_time_t now;
341341
int elapsed_secs;
342342
int cur_timeout;
343+
bool chkpt_or_rstpt_requested = false;
344+
bool chkpt_or_rstpt_timed = false;
343345

344346
/* Clear any already-pending wakeups */
345347
ResetLatch(MyLatch);
@@ -358,7 +360,7 @@ CheckpointerMain(void)
358360
if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
359361
{
360362
do_checkpoint = true;
361-
PendingCheckpointerStats.num_requested++;
363+
chkpt_or_rstpt_requested = true;
362364
}
363365

364366
/*
@@ -372,7 +374,7 @@ CheckpointerMain(void)
372374
if (elapsed_secs >= CheckPointTimeout)
373375
{
374376
if (!do_checkpoint)
375-
PendingCheckpointerStats.num_timed++;
377+
chkpt_or_rstpt_timed = true;
376378
do_checkpoint = true;
377379
flags |= CHECKPOINT_CAUSE_TIME;
378380
}
@@ -408,6 +410,24 @@ CheckpointerMain(void)
408410
if (flags & CHECKPOINT_END_OF_RECOVERY)
409411
do_restartpoint = false;
410412

413+
if (chkpt_or_rstpt_timed)
414+
{
415+
chkpt_or_rstpt_timed = false;
416+
if (do_restartpoint)
417+
PendingCheckpointerStats.restartpoints_timed++;
418+
else
419+
PendingCheckpointerStats.num_timed++;
420+
}
421+
422+
if (chkpt_or_rstpt_requested)
423+
{
424+
chkpt_or_rstpt_requested = false;
425+
if (do_restartpoint)
426+
PendingCheckpointerStats.restartpoints_requested++;
427+
else
428+
PendingCheckpointerStats.num_requested++;
429+
}
430+
411431
/*
412432
* We will warn if (a) too soon since last checkpoint (whatever
413433
* caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
@@ -471,6 +491,9 @@ CheckpointerMain(void)
471491
* checkpoints happen at a predictable spacing.
472492
*/
473493
last_checkpoint_time = now;
494+
495+
if (do_restartpoint)
496+
PendingCheckpointerStats.restartpoints_performed++;
474497
}
475498
else
476499
{

src/backend/utils/activity/pgstat_checkpointer.c

+6
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,9 @@ pgstat_report_checkpointer(void)
4949
#define CHECKPOINTER_ACC(fld) stats_shmem->stats.fld += PendingCheckpointerStats.fld
5050
CHECKPOINTER_ACC(num_timed);
5151
CHECKPOINTER_ACC(num_requested);
52+
CHECKPOINTER_ACC(restartpoints_timed);
53+
CHECKPOINTER_ACC(restartpoints_requested);
54+
CHECKPOINTER_ACC(restartpoints_performed);
5255
CHECKPOINTER_ACC(write_time);
5356
CHECKPOINTER_ACC(sync_time);
5457
CHECKPOINTER_ACC(buffers_written);
@@ -116,6 +119,9 @@ pgstat_checkpointer_snapshot_cb(void)
116119
#define CHECKPOINTER_COMP(fld) pgStatLocal.snapshot.checkpointer.fld -= reset.fld;
117120
CHECKPOINTER_COMP(num_timed);
118121
CHECKPOINTER_COMP(num_requested);
122+
CHECKPOINTER_COMP(restartpoints_timed);
123+
CHECKPOINTER_COMP(restartpoints_requested);
124+
CHECKPOINTER_COMP(restartpoints_performed);
119125
CHECKPOINTER_COMP(write_time);
120126
CHECKPOINTER_COMP(sync_time);
121127
CHECKPOINTER_COMP(buffers_written);

src/backend/utils/adt/pgstatfuncs.c

+18
Original file line numberDiff line numberDiff line change
@@ -1193,6 +1193,24 @@ pg_stat_get_checkpointer_num_requested(PG_FUNCTION_ARGS)
11931193
PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->num_requested);
11941194
}
11951195

1196+
Datum
1197+
pg_stat_get_checkpointer_restartpoints_timed(PG_FUNCTION_ARGS)
1198+
{
1199+
PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->restartpoints_timed);
1200+
}
1201+
1202+
Datum
1203+
pg_stat_get_checkpointer_restartpoints_requested(PG_FUNCTION_ARGS)
1204+
{
1205+
PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->restartpoints_requested);
1206+
}
1207+
1208+
Datum
1209+
pg_stat_get_checkpointer_restartpoints_performed(PG_FUNCTION_ARGS)
1210+
{
1211+
PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->restartpoints_performed);
1212+
}
1213+
11961214
Datum
11971215
pg_stat_get_checkpointer_buffers_written(PG_FUNCTION_ARGS)
11981216
{

src/include/catalog/catversion.h

+1-1
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,6 @@
5757
*/
5858

5959
/* yyyymmddN */
60-
#define CATALOG_VERSION_NO 202312211
60+
#define CATALOG_VERSION_NO 202312251
6161

6262
#endif

src/include/catalog/pg_proc.dat

+15
Original file line numberDiff line numberDiff line change
@@ -5721,6 +5721,21 @@
57215721
proname => 'pg_stat_get_checkpointer_num_requested', provolatile => 's',
57225722
proparallel => 'r', prorettype => 'int8', proargtypes => '',
57235723
prosrc => 'pg_stat_get_checkpointer_num_requested' },
5724+
{ oid => '8743',
5725+
descr => 'statistics: number of timed restartpoints started by the checkpointer',
5726+
proname => 'pg_stat_get_checkpointer_restartpoints_timed', provolatile => 's',
5727+
proparallel => 'r', prorettype => 'int8', proargtypes => '',
5728+
prosrc => 'pg_stat_get_checkpointer_restartpoints_timed' },
5729+
{ oid => '8744',
5730+
descr => 'statistics: number of backend requested restartpoints started by the checkpointer',
5731+
proname => 'pg_stat_get_checkpointer_restartpoints_requested', provolatile => 's',
5732+
proparallel => 'r', prorettype => 'int8', proargtypes => '',
5733+
prosrc => 'pg_stat_get_checkpointer_restartpoints_requested' },
5734+
{ oid => '8745',
5735+
descr => 'statistics: number of backend performed restartpoints',
5736+
proname => 'pg_stat_get_checkpointer_restartpoints_performed', provolatile => 's',
5737+
proparallel => 'r', prorettype => 'int8', proargtypes => '',
5738+
prosrc => 'pg_stat_get_checkpointer_restartpoints_performed' },
57245739
{ oid => '2771',
57255740
descr => 'statistics: number of buffers written by the checkpointer',
57265741
proname => 'pg_stat_get_checkpointer_buffers_written', provolatile => 's',

src/include/pgstat.h

+3
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,9 @@ typedef struct PgStat_CheckpointerStats
262262
{
263263
PgStat_Counter num_timed;
264264
PgStat_Counter num_requested;
265+
PgStat_Counter restartpoints_timed;
266+
PgStat_Counter restartpoints_requested;
267+
PgStat_Counter restartpoints_performed;
265268
PgStat_Counter write_time; /* times in milliseconds */
266269
PgStat_Counter sync_time;
267270
PgStat_Counter buffers_written;

src/test/regress/expected/rules.out

+3
Original file line numberDiff line numberDiff line change
@@ -1822,6 +1822,9 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_buf_written_clean() AS buffers_cle
18221822
pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
18231823
pg_stat_checkpointer| SELECT pg_stat_get_checkpointer_num_timed() AS num_timed,
18241824
pg_stat_get_checkpointer_num_requested() AS num_requested,
1825+
pg_stat_get_checkpointer_restartpoints_timed() AS restartpoints_timed,
1826+
pg_stat_get_checkpointer_restartpoints_requested() AS restartpoints_req,
1827+
pg_stat_get_checkpointer_restartpoints_performed() AS restartpoints_done,
18251828
pg_stat_get_checkpointer_write_time() AS write_time,
18261829
pg_stat_get_checkpointer_sync_time() AS sync_time,
18271830
pg_stat_get_checkpointer_buffers_written() AS buffers_written,

0 commit comments

Comments
 (0)