|
| 1 | +From pgsql-hackers-owner+M908@postgresql.org Sun Nov 19 14:27:43 2000 |
| 2 | +Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) |
| 3 | + by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id OAA10885 |
| 4 | + for <pgman@candle.pha.pa.us>; Sun, 19 Nov 2000 14:27:42 -0500 (EST) |
| 5 | +Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) |
| 6 | + by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eAJJSMs83653; |
| 7 | + Sun, 19 Nov 2000 14:28:22 -0500 (EST) |
| 8 | + (envelope-from pgsql-hackers-owner+M908@postgresql.org) |
| 9 | +Received: from candle.pha.pa.us (candle.navpoint.com [162.33.245.46] (may be forged)) |
| 10 | + by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eAJJQns83565 |
| 11 | + for <pgsql-hackers@postgreSQL.org>; Sun, 19 Nov 2000 14:26:49 -0500 (EST) |
| 12 | + (envelope-from pgman@candle.pha.pa.us) |
| 13 | +Received: (from pgman@localhost) |
| 14 | + by candle.pha.pa.us (8.9.0/8.9.0) id OAA06790; |
| 15 | + Sun, 19 Nov 2000 14:23:06 -0500 (EST) |
| 16 | +From: Bruce Momjian <pgman@candle.pha.pa.us> |
| 17 | +Message-Id: <200011191923.OAA06790@candle.pha.pa.us> |
| 18 | +Subject: Re: [HACKERS] WAL fsync scheduling |
| 19 | +In-Reply-To: <002101c0525e$2d964480$b97a30d0@sectorbase.com> "from Vadim Mikheev |
| 20 | + at Nov 19, 2000 11:23:19 am" |
| 21 | +To: Vadim Mikheev <vmikheev@sectorbase.com> |
| 22 | +Date: Sun, 19 Nov 2000 14:23:06 -0500 (EST) |
| 23 | +CC: Tom Samplonius <tom@sdf.com>, Alfred@candle.pha.pa.us, |
| 24 | + Perlstein <bright@wintelcom.net>, Larry@candle.pha.pa.us, |
| 25 | + Rosenman <ler@lerctr.org>, |
| 26 | + PostgreSQL-development <pgsql-hackers@postgresql.org> |
| 27 | +X-Mailer: ELM [version 2.4ME+ PL77 (25)] |
| 28 | +MIME-Version: 1.0 |
| 29 | +Content-Transfer-Encoding: 7bit |
| 30 | +Content-Type: text/plain; charset=US-ASCII |
| 31 | +Precedence: bulk |
| 32 | +Sender: pgsql-hackers-owner@postgresql.org |
| 33 | +Status: OR |
| 34 | + |
| 35 | +[ Charset ISO-8859-1 unsupported, converting... ] |
| 36 | +> > There are two parts to transaction commit. The first is writing all |
| 37 | +> > dirty buffers or log changes to the kernel, and second is fsync of the |
| 38 | +> ^^^^^^^^^^^^ |
| 39 | +> Backend doesn't write any dirty buffer to the kernel at commit time. |
| 40 | + |
| 41 | +Yes, I suspected that. |
| 42 | + |
| 43 | +> |
| 44 | +> > log file. |
| 45 | +> |
| 46 | +> The first part is writing commit record into WAL buffers in shmem. |
| 47 | +> This is what XLogInsert does. After that XLogFlush is called to ensure |
| 48 | +> that entire commit record is on disk. XLogFlush does *both* write() and |
| 49 | +> fsync() (single slock is used for both writing and fsyncing) if it needs to |
| 50 | +> do it at all. |
| 51 | + |
| 52 | +Yes, I realize there are new steps in WAL. |
| 53 | + |
| 54 | +> |
| 55 | +> > I suggest having a per-backend shared memory byte that has the following |
| 56 | +> > values: |
| 57 | +> > |
| 58 | +> > START_LOG_WRITE |
| 59 | +> > WAIT_ON_FSYNC |
| 60 | +> > NOT_IN_COMMIT |
| 61 | +> > backend_number_doing_fsync |
| 62 | +> > |
| 63 | +> > I suggest that when each backend starts a commit, it sets its byte to |
| 64 | +> > START_LOG_WRITE. |
| 65 | +> ^^^^^^^^^^^^^^^^^^^^^^^ |
| 66 | +> Isn't START_COMMIT more meaningful? |
| 67 | + |
| 68 | +Yes. |
| 69 | + |
| 70 | +> |
| 71 | +> > When it gets ready to fsync, it checks all backends. |
| 72 | +> ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 73 | +> What do you mean by this? The moment just after XLogInsert? |
| 74 | + |
| 75 | +Just before it calls fsync(). |
| 76 | + |
| 77 | +> |
| 78 | +> > If all are NOT_IN_COMMIT, it does fsync and continues. |
| 79 | +> |
| 80 | +> 1st edition: |
| 81 | +> > If one or more are in START_LOG_WRITE, it waits until no one is in |
| 82 | +> > START_LOG_WRITE. It then checks all WAIT_ON_FSYNC, and if it is the |
| 83 | +> > lowest backend in WAIT_ON_FSYNC, marks all others with its backend |
| 84 | +> > number, and does fsync. It then clears all backends with its number to |
| 85 | +> > NOT_IN_COMMIT. Other backend will see they are not the lowest |
| 86 | +> > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT |
| 87 | +> > so they can then continue, knowing their data was synced. |
| 88 | +> |
| 89 | +> 2nd edition: |
| 90 | +> > I have another idea. If a backend gets to the point that it needs |
| 91 | +> > fsync, and there is another backend in START_LOG_WRITE, it can go to an |
| 92 | +> > interuptable sleep, knowing another backend will perform the fsync and |
| 93 | +> > wake it up. Therefore, there is no busy-wait or timed sleep. |
| 94 | +> > |
| 95 | +> > Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a |
| 96 | +> > race condition. |
| 97 | +> |
| 98 | +> The 2nd edition is much better. But I'm not sure do we really need in |
| 99 | +> these per-backend bytes in shmem. Why not just have some counters? |
| 100 | +> We can use a semaphore to wake-up all waiters at once. |
| 101 | + |
| 102 | +Yes, that is much better and clearer. My idea was just to say, "if no |
| 103 | +one is entering commit phase, do the commit. If someone else is coming, |
| 104 | +sleep and wait for them to do the fsync and wake me up with a singal." |
| 105 | + |
| 106 | +> |
| 107 | +> > This allows a single backend not to sleep, and allows multiple backends |
| 108 | +> > to bunch up only when they are all about to commit. |
| 109 | +> > |
| 110 | +> > The reason backend numbers are written is so other backends entering the |
| 111 | +> > commit code will not interfere with the backends performing fsync. |
| 112 | +> |
| 113 | +> Being waked-up backend can check what's written/fsynced by calling XLogFlush. |
| 114 | + |
| 115 | +Seems that may not be needed anymore with a counter. The only issue is |
| 116 | +that other backends may enter commit while fsync() is happening. The |
| 117 | +process that did the fsync must be sure to wake up only the backends |
| 118 | +that were waiting for it, and not other backends that may be also be |
| 119 | +doing fsync as a group while the first fsync was happening. I leave |
| 120 | +those details to people more experienced. :-) |
| 121 | + |
| 122 | +I am just glad people liked my idea. |
| 123 | + |
| 124 | +-- |
| 125 | + Bruce Momjian | http://candle.pha.pa.us |
| 126 | + pgman@candle.pha.pa.us | (610) 853-3000 |
| 127 | + + If your life is a hard drive, | 830 Blythe Avenue |
| 128 | + + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 |
| 129 | + |
0 commit comments