Add to mmap emails.

bmomjian · bmomjian · commit 2e6887df6356 · 2003-03-07T17:43:26.000Z
diff --git a/doc/TODO.detail/mmap b/doc/TODO.detail/mmap
@@ -1763,3 +1763,254 @@ message can get through to the mailing list cleanly
 
 
 
+From pgsql-committers-owner+M9273=maillist=candle.pha.pa.us@postgresql.org Thu Mar  6 19:37:25 2003
+Return-path: <pgsql-committers-owner+M9273=maillist=candle.pha.pa.us@postgresql.org>
+Received: from relay2.pgsql.com (relay2.pgsql.com [64.49.215.143])
+	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id h270bM624923
+	for <maillist@candle.pha.pa.us>; Thu, 6 Mar 2003 19:37:24 -0500 (EST)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+	by relay2.pgsql.com (Postfix) with ESMTP id 4D5CDEE0411
+	for <maillist@candle.pha.pa.us>; Thu,  6 Mar 2003 19:37:23 -0500 (EST)
+X-Original-To: pgsql-committers@postgresql.org
+Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251])
+	by postgresql.org (Postfix) with ESMTP
+	id 3120E47646F; Thu,  6 Mar 2003 19:36:58 -0500 (EST)
+Received: by perrin.int.nxad.com (Postfix, from userid 1001)
+	id 9CBE42105B; Thu,  6 Mar 2003 16:36:40 -0800 (PST)
+Date: Thu, 6 Mar 2003 16:36:40 -0800
+From: Sean Chittenden <sean@chittenden.org>
+To: Tom Lane <tgl@sss.pgh.pa.us>
+cc: Christopher Kings-Lynne <chriskl@familyhealth.com.au>,
+   pgsql-committers@postgresql.org, pgsql-performance@postgresql.org
+Subject: Re: [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ...
+Message-ID: <20030307003640.GF79234@perrin.int.nxad.com>
+References: <20030306031656.1876F4762E0@postgresql.org> <032f01c2e390$b1842b20$6500a8c0@fhp.internal> <11077.1046921667@sss.pgh.pa.us> <033f01c2e392$71476570$6500a8c0@fhp.internal> <12228.1046922471@sss.pgh.pa.us> <20030306094117.GA79234@perrin.int.nxad.com> <15071.1046964336@sss.pgh.pa.us>
+MIME-Version: 1.0
+Content-Type: multipart/signed; micalg=pgp-sha1;
+	protocol="application/pgp-signature"; boundary="HjNkcEWJ4DMx36DP"
+Content-Disposition: inline
+In-Reply-To: <15071.1046964336@sss.pgh.pa.us>
+User-Agent: Mutt/1.4i
+X-PGP-Key: finger seanc@FreeBSD.org
+X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0  83A6 DD99 E31F BC84 B341
+X-Web-Homepage: http://sean.chittenden.org/
+Precedence: bulk
+Sender: pgsql-committers-owner@postgresql.org
+Status: OR
+
+--HjNkcEWJ4DMx36DP
+Content-Type: text/plain; charset=us-ascii
+Content-Disposition: inline
+Content-Transfer-Encoding: quoted-printable
+
+[moving to -performance, please drop -committers from replies]
+
+> > I've toyed with the idea of adding this because it is monstrously more
+> > efficient than select()/poll() in basically every way, shape, and
+> > form.
+>=20
+> From what I've looked at, kqueue only wins when you are watching a
+> large number of file descriptors at the same time; which is an
+> operation done nowhere in Postgres.  I think the above would be a
+> complete waste of effort.
+
+It scales very well to many thousands of descriptors, but it also
+works well on small numbers as well.  kqueue is about 5x faster than
+select() or poll() on the low end of number of fd's.  As I said
+earlier, I don't think there is _much_ to gain in this regard, but I
+do think that it would be a speed improvement but only to one OS
+supported by PostgreSQL.  I think that there are bigger speed
+improvements to be had elsewhere in the code.
+
+> > Is this one of the areas of PostgreSQL that just needs to get
+> > slowly migrated to use mmap() or are there any gaping reasons why
+> > to not use the family of system calls?
+>=20
+> There has been much speculation on this, and no proof that it
+> actually buys us anything to justify the portability hit.
+
+Actually, I think that it wouldn't be that big of a portability hit
+because you still would read() and write() as always, but in
+performance sensitive areas, an #ifdef HAVE_MMAP section would have
+the appropriate mmap() calls.  If the system doesn't have mmap(),
+there isn't much to loose and we're in the same position we're in now.
+
+> There would be some nontrivial problems to solve, such as the
+> mechanics of accessing a large number of files from a large number
+> of backends without running out of virtual memory.  Also, is it
+> guaranteed that multiple backends mmap'ing the same block will
+> access the very same physical buffer, and not multiple copies?
+> Multiple copies would be fatal.  See the acrhives for more
+> discussion.
+
+Have read through the archives.  Making a call to madvise() will speed
+up access to the pages as it gives hints to the VM about what order
+the pages are accessed/used.  Here are a few bits from the BSD mmap()
+and madvise() man pages:
+
+mmap(2):
+     MAP_NOSYNC        Causes data dirtied via this VM map to be flushed to
+                       physical media only when necessary (usually by the
+                       pager) rather then gratuitously.  Typically this pre-
+                       vents the update daemons from flushing pages dirtied
+                       through such maps and thus allows efficient sharing =
+of
+                       memory across unassociated processes using a file-
+                       backed shared memory map.  Without this option any VM
+                       pages you dirty may be flushed to disk every so often
+                       (every 30-60 seconds usually) which can create perfo=
+r-
+                       mance problems if you do not need that to occur (such
+                       as when you are using shared file-backed mmap regions
+                       for IPC purposes).  Note that VM/filesystem coherency
+                       is maintained whether you use MAP_NOSYNC or not.  Th=
+is
+                       option is not portable across UNIX platforms (yet),
+                       though some may implement the same behavior by defau=
+lt.
+
+                       WARNING!  Extending a file with ftruncate(2), thus c=
+re-
+                       ating a big hole, and then filling the hole by modif=
+y-
+                       ing a shared mmap() can lead to severe file fragment=
+a-
+                       tion.  In order to avoid such fragmentation you shou=
+ld
+                       always pre-allocate the file's backing store by
+                       write()ing zero's into the newly extended area prior=
+ to
+                       modifying the area via your mmap().  The fragmentati=
+on
+                       problem is especially sensitive to MAP_NOSYNC pages,
+                       because pages may be flushed to disk in a totally ra=
+n-
+                       dom order.
+
+                       The same applies when using MAP_NOSYNC to implement a
+                       file-based shared memory store.  It is recommended t=
+hat
+                       you create the backing store by write()ing zero's to
+                       the backing file rather then ftruncate()ing it.  You
+                       can test file fragmentation by observing the KB/t
+                       (kilobytes per transfer) results from an ``iostat 1''
+                       while reading a large file sequentially, e.g. using
+                       ``dd if=3Dfilename of=3D/dev/null bs=3D32k''.
+
+                       The fsync(2) function will flush all dirty data and
+                       metadata associated with a file, including dirty NOS=
+YNC
+                       VM data, to physical media.  The sync(8) command and
+                       sync(2) system call generally do not flush dirty NOS=
+YNC
+                       VM data.  The msync(2) system call is obsolete since
+                       BSD implements a coherent filesystem buffer cache.
+                       However, it may be used to associate dirty VM pages
+                       with filesystem buffers and thus cause them to be
+                       flushed to physical media sooner rather then later.
+
+madvise(2):
+     MADV_NORMAL      Tells the system to revert to the default paging beha=
+v-
+                      ior.
+
+     MADV_RANDOM      Is a hint that pages will be accessed randomly, and
+                      prefetching is likely not advantageous.
+
+     MADV_SEQUENTIAL  Causes the VM system to depress the priority of pages
+                      immediately preceding a given page when it is faulted
+                      in.
+
+mprotect(2):
+     The mprotect() system call changes the specified pages to have protect=
+ion
+     prot.  Not all implementations will guarantee protection on a page bas=
+is;
+     the granularity of protection changes may be as large as an entire
+     region.  A region is the virtual address space defined by the start and
+     end addresses of a struct vm_map_entry.
+
+     Currently these protection bits are known, which can be combined, OR'd
+     together:
+
+     PROT_NONE     No permissions at all.
+
+     PROT_READ     The pages can be read.
+
+     PROT_WRITE    The pages can be written.
+
+     PROT_EXEC     The pages can be executed.
+
+msync(2):
+     The msync() system call writes any modified pages back to the filesyst=
+em
+     and updates the file modification time.  If len is 0, all modified pag=
+es
+     within the region containing addr will be flushed; if len is non-zero,
+     only those pages containing addr and len-1 succeeding locations will be
+     examined.  The flags argument may be specified as follows:
+
+     MS_ASYNC        Return immediately
+     MS_SYNC         Perform synchronous writes
+     MS_INVALIDATE   Invalidate all cached data
+
+
+A few thoughts come to mind:
+
+1) backends could share buffers by mmap()'ing shared regions of data.
+   While I haven't seen any numbers to reflect this, I'd wager that
+   mmap() is a faster interface than ipc.
+
+2) It looks like while there are various file IO schemes scattered all
+   over the place, the bulk of the critical routines that would need
+   to be updated are in backend/storage/file/fd.c, more specifically:
+
+   *) fileNameOpenFile() would need the appropriate mmap() call made
+      to it.
+
+   *) FileTruncate() would need some attention to avoid fragmentation.
+
+   *) a new "sync" GUC would have to be introduced to handle msync
+      (affects only pg_fsync() and pg_fdatasync()).
+
+3) There's a bit of code in pgsql/src/backend/storage/smgr that could
+   be gutted/removed.  Which of those storage types are even used any
+   more?  There's a reference in the code to PostgreSQL 3.0.  :)
+
+And I think that'd be it.  The LRU code could be used if necessary to
+help manage the amount of mmap()'ed in the VM at any one time, at the
+very least that could be a handled by a shm var that various backends
+would increment/decrement as files are open()'ed/close()'ed.
+
+I didn't spend too long looking at this, but I _think_ that'd cover
+80% of PostgreSQL's disk access needs.  The next bit to possibly add
+would be passing a flag on FileOpen operations that'd act as a hint to
+madvise() that way the VM could proactively react to PostgreSQL's
+needs.
+
+I don't have my copy of Steven's handy (it's some 700mi away atm
+otherwise I'd cite it), but if Tom or someone else has it handy, look
+up the example re: the performance gain from read()'ing an mmap()'ed
+file versus a non-mmap()'ed file.  The difference is non-trivial and
+_WELL_ worth the time given the speed increase.  The same speed
+benefit held true for writes as well, iirc.  It's been a while, but I
+think it was around page 330.  The index has it listed and it's not
+that hard of an example to find.  -sc
+
+--=20
+Sean Chittenden
+
+--HjNkcEWJ4DMx36DP
+Content-Type: application/pgp-signature
+Content-Disposition: inline
+
+-----BEGIN PGP SIGNATURE-----
+Comment: Sean Chittenden <sean@chittenden.org>
+
+iD8DBQE+Z+mY3ZnjH7yEs0ERAjVkAJwMI1V7+HvMAA5ODadD5znsekI8TQCgvH0C
+KwvG7YLsJ+xpsTUS67KD+4M=
+=w8/7
+-----END PGP SIGNATURE-----
+
+--HjNkcEWJ4DMx36DP--
+