@@ -1763,3 +1763,254 @@ message can get through to the mailing list cleanly
1763
1763
1764
1764
1765
1765
1766
+ From pgsql-committers-owner+M9273=maillist=candle.pha.pa.us@postgresql.org Thu Mar 6 19:37:25 2003
1767
+ Return-path: <pgsql-committers-owner+M9273=maillist=candle.pha.pa.us@postgresql.org>
1768
+ Received: from relay2.pgsql.com (relay2.pgsql.com [64.49.215.143])
1769
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id h270bM624923
1770
+ for <maillist@candle.pha.pa.us>; Thu, 6 Mar 2003 19:37:24 -0500 (EST)
1771
+ Received: from postgresql.org (postgresql.org [64.49.215.8])
1772
+ by relay2.pgsql.com (Postfix) with ESMTP id 4D5CDEE0411
1773
+ for <maillist@candle.pha.pa.us>; Thu, 6 Mar 2003 19:37:23 -0500 (EST)
1774
+ X-Original-To: pgsql-committers@postgresql.org
1775
+ Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251])
1776
+ by postgresql.org (Postfix) with ESMTP
1777
+ id 3120E47646F; Thu, 6 Mar 2003 19:36:58 -0500 (EST)
1778
+ Received: by perrin.int.nxad.com (Postfix, from userid 1001)
1779
+ id 9CBE42105B; Thu, 6 Mar 2003 16:36:40 -0800 (PST)
1780
+ Date: Thu, 6 Mar 2003 16:36:40 -0800
1781
+ From: Sean Chittenden <sean@chittenden.org>
1782
+ To: Tom Lane <tgl@sss.pgh.pa.us>
1783
+ cc: Christopher Kings-Lynne <chriskl@familyhealth.com.au>,
1784
+ pgsql-committers@postgresql.org, pgsql-performance@postgresql.org
1785
+ Subject: Re: [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ...
1786
+ Message-ID: <20030307003640.GF79234@perrin.int.nxad.com>
1787
+ References: <20030306031656.1876F4762E0@postgresql.org> <032f01c2e390$b1842b20$6500a8c0@fhp.internal> <11077.1046921667@sss.pgh.pa.us> <033f01c2e392$71476570$6500a8c0@fhp.internal> <12228.1046922471@sss.pgh.pa.us> <20030306094117.GA79234@perrin.int.nxad.com> <15071.1046964336@sss.pgh.pa.us>
1788
+ MIME-Version: 1.0
1789
+ Content-Type: multipart/signed; micalg=pgp-sha1;
1790
+ protocol="application/pgp-signature"; boundary="HjNkcEWJ4DMx36DP"
1791
+ Content-Disposition: inline
1792
+ In-Reply-To: <15071.1046964336@sss.pgh.pa.us>
1793
+ User-Agent: Mutt/1.4i
1794
+ X-PGP-Key: finger seanc@FreeBSD.org
1795
+ X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341
1796
+ X-Web-Homepage: http://sean.chittenden.org/
1797
+ Precedence: bulk
1798
+ Sender: pgsql-committers-owner@postgresql.org
1799
+ Status: OR
1800
+
1801
+ --HjNkcEWJ4DMx36DP
1802
+ Content-Type: text/plain; charset=us-ascii
1803
+ Content-Disposition: inline
1804
+ Content-Transfer-Encoding: quoted-printable
1805
+
1806
+ [moving to -performance, please drop -committers from replies]
1807
+
1808
+ > > I've toyed with the idea of adding this because it is monstrously more
1809
+ > > efficient than select()/poll() in basically every way, shape, and
1810
+ > > form.
1811
+ >=20
1812
+ > From what I've looked at, kqueue only wins when you are watching a
1813
+ > large number of file descriptors at the same time; which is an
1814
+ > operation done nowhere in Postgres. I think the above would be a
1815
+ > complete waste of effort.
1816
+
1817
+ It scales very well to many thousands of descriptors, but it also
1818
+ works well on small numbers as well. kqueue is about 5x faster than
1819
+ select() or poll() on the low end of number of fd's. As I said
1820
+ earlier, I don't think there is _much_ to gain in this regard, but I
1821
+ do think that it would be a speed improvement but only to one OS
1822
+ supported by PostgreSQL. I think that there are bigger speed
1823
+ improvements to be had elsewhere in the code.
1824
+
1825
+ > > Is this one of the areas of PostgreSQL that just needs to get
1826
+ > > slowly migrated to use mmap() or are there any gaping reasons why
1827
+ > > to not use the family of system calls?
1828
+ >=20
1829
+ > There has been much speculation on this, and no proof that it
1830
+ > actually buys us anything to justify the portability hit.
1831
+
1832
+ Actually, I think that it wouldn't be that big of a portability hit
1833
+ because you still would read() and write() as always, but in
1834
+ performance sensitive areas, an #ifdef HAVE_MMAP section would have
1835
+ the appropriate mmap() calls. If the system doesn't have mmap(),
1836
+ there isn't much to loose and we're in the same position we're in now.
1837
+
1838
+ > There would be some nontrivial problems to solve, such as the
1839
+ > mechanics of accessing a large number of files from a large number
1840
+ > of backends without running out of virtual memory. Also, is it
1841
+ > guaranteed that multiple backends mmap'ing the same block will
1842
+ > access the very same physical buffer, and not multiple copies?
1843
+ > Multiple copies would be fatal. See the acrhives for more
1844
+ > discussion.
1845
+
1846
+ Have read through the archives. Making a call to madvise() will speed
1847
+ up access to the pages as it gives hints to the VM about what order
1848
+ the pages are accessed/used. Here are a few bits from the BSD mmap()
1849
+ and madvise() man pages:
1850
+
1851
+ mmap(2):
1852
+ MAP_NOSYNC Causes data dirtied via this VM map to be flushed to
1853
+ physical media only when necessary (usually by the
1854
+ pager) rather then gratuitously. Typically this pre-
1855
+ vents the update daemons from flushing pages dirtied
1856
+ through such maps and thus allows efficient sharing =
1857
+ of
1858
+ memory across unassociated processes using a file-
1859
+ backed shared memory map. Without this option any VM
1860
+ pages you dirty may be flushed to disk every so often
1861
+ (every 30-60 seconds usually) which can create perfo=
1862
+ r-
1863
+ mance problems if you do not need that to occur (such
1864
+ as when you are using shared file-backed mmap regions
1865
+ for IPC purposes). Note that VM/filesystem coherency
1866
+ is maintained whether you use MAP_NOSYNC or not. Th=
1867
+ is
1868
+ option is not portable across UNIX platforms (yet),
1869
+ though some may implement the same behavior by defau=
1870
+ lt.
1871
+
1872
+ WARNING! Extending a file with ftruncate(2), thus c=
1873
+ re-
1874
+ ating a big hole, and then filling the hole by modif=
1875
+ y-
1876
+ ing a shared mmap() can lead to severe file fragment=
1877
+ a-
1878
+ tion. In order to avoid such fragmentation you shou=
1879
+ ld
1880
+ always pre-allocate the file's backing store by
1881
+ write()ing zero's into the newly extended area prior=
1882
+ to
1883
+ modifying the area via your mmap(). The fragmentati=
1884
+ on
1885
+ problem is especially sensitive to MAP_NOSYNC pages,
1886
+ because pages may be flushed to disk in a totally ra=
1887
+ n-
1888
+ dom order.
1889
+
1890
+ The same applies when using MAP_NOSYNC to implement a
1891
+ file-based shared memory store. It is recommended t=
1892
+ hat
1893
+ you create the backing store by write()ing zero's to
1894
+ the backing file rather then ftruncate()ing it. You
1895
+ can test file fragmentation by observing the KB/t
1896
+ (kilobytes per transfer) results from an ``iostat 1''
1897
+ while reading a large file sequentially, e.g. using
1898
+ ``dd if=3Dfilename of=3D/dev/null bs=3D32k''.
1899
+
1900
+ The fsync(2) function will flush all dirty data and
1901
+ metadata associated with a file, including dirty NOS=
1902
+ YNC
1903
+ VM data, to physical media. The sync(8) command and
1904
+ sync(2) system call generally do not flush dirty NOS=
1905
+ YNC
1906
+ VM data. The msync(2) system call is obsolete since
1907
+ BSD implements a coherent filesystem buffer cache.
1908
+ However, it may be used to associate dirty VM pages
1909
+ with filesystem buffers and thus cause them to be
1910
+ flushed to physical media sooner rather then later.
1911
+
1912
+ madvise(2):
1913
+ MADV_NORMAL Tells the system to revert to the default paging beha=
1914
+ v-
1915
+ ior.
1916
+
1917
+ MADV_RANDOM Is a hint that pages will be accessed randomly, and
1918
+ prefetching is likely not advantageous.
1919
+
1920
+ MADV_SEQUENTIAL Causes the VM system to depress the priority of pages
1921
+ immediately preceding a given page when it is faulted
1922
+ in.
1923
+
1924
+ mprotect(2):
1925
+ The mprotect() system call changes the specified pages to have protect=
1926
+ ion
1927
+ prot. Not all implementations will guarantee protection on a page bas=
1928
+ is;
1929
+ the granularity of protection changes may be as large as an entire
1930
+ region. A region is the virtual address space defined by the start and
1931
+ end addresses of a struct vm_map_entry.
1932
+
1933
+ Currently these protection bits are known, which can be combined, OR'd
1934
+ together:
1935
+
1936
+ PROT_NONE No permissions at all.
1937
+
1938
+ PROT_READ The pages can be read.
1939
+
1940
+ PROT_WRITE The pages can be written.
1941
+
1942
+ PROT_EXEC The pages can be executed.
1943
+
1944
+ msync(2):
1945
+ The msync() system call writes any modified pages back to the filesyst=
1946
+ em
1947
+ and updates the file modification time. If len is 0, all modified pag=
1948
+ es
1949
+ within the region containing addr will be flushed; if len is non-zero,
1950
+ only those pages containing addr and len-1 succeeding locations will be
1951
+ examined. The flags argument may be specified as follows:
1952
+
1953
+ MS_ASYNC Return immediately
1954
+ MS_SYNC Perform synchronous writes
1955
+ MS_INVALIDATE Invalidate all cached data
1956
+
1957
+
1958
+ A few thoughts come to mind:
1959
+
1960
+ 1) backends could share buffers by mmap()'ing shared regions of data.
1961
+ While I haven't seen any numbers to reflect this, I'd wager that
1962
+ mmap() is a faster interface than ipc.
1963
+
1964
+ 2) It looks like while there are various file IO schemes scattered all
1965
+ over the place, the bulk of the critical routines that would need
1966
+ to be updated are in backend/storage/file/fd.c, more specifically:
1967
+
1968
+ *) fileNameOpenFile() would need the appropriate mmap() call made
1969
+ to it.
1970
+
1971
+ *) FileTruncate() would need some attention to avoid fragmentation.
1972
+
1973
+ *) a new "sync" GUC would have to be introduced to handle msync
1974
+ (affects only pg_fsync() and pg_fdatasync()).
1975
+
1976
+ 3) There's a bit of code in pgsql/src/backend/storage/smgr that could
1977
+ be gutted/removed. Which of those storage types are even used any
1978
+ more? There's a reference in the code to PostgreSQL 3.0. :)
1979
+
1980
+ And I think that'd be it. The LRU code could be used if necessary to
1981
+ help manage the amount of mmap()'ed in the VM at any one time, at the
1982
+ very least that could be a handled by a shm var that various backends
1983
+ would increment/decrement as files are open()'ed/close()'ed.
1984
+
1985
+ I didn't spend too long looking at this, but I _think_ that'd cover
1986
+ 80% of PostgreSQL's disk access needs. The next bit to possibly add
1987
+ would be passing a flag on FileOpen operations that'd act as a hint to
1988
+ madvise() that way the VM could proactively react to PostgreSQL's
1989
+ needs.
1990
+
1991
+ I don't have my copy of Steven's handy (it's some 700mi away atm
1992
+ otherwise I'd cite it), but if Tom or someone else has it handy, look
1993
+ up the example re: the performance gain from read()'ing an mmap()'ed
1994
+ file versus a non-mmap()'ed file. The difference is non-trivial and
1995
+ _WELL_ worth the time given the speed increase. The same speed
1996
+ benefit held true for writes as well, iirc. It's been a while, but I
1997
+ think it was around page 330. The index has it listed and it's not
1998
+ that hard of an example to find. -sc
1999
+
2000
+ --=20
2001
+ Sean Chittenden
2002
+
2003
+ --HjNkcEWJ4DMx36DP
2004
+ Content-Type: application/pgp-signature
2005
+ Content-Disposition: inline
2006
+
2007
+ -----BEGIN PGP SIGNATURE-----
2008
+ Comment: Sean Chittenden <sean@chittenden.org>
2009
+
2010
+ iD8DBQE+Z+mY3ZnjH7yEs0ERAjVkAJwMI1V7+HvMAA5ODadD5znsekI8TQCgvH0C
2011
+ KwvG7YLsJ+xpsTUS67KD+4M=
2012
+ =w8/7
2013
+ -----END PGP SIGNATURE-----
2014
+
2015
+ --HjNkcEWJ4DMx36DP--
2016
+
0 commit comments