postgrespro
diff --git a/‎doc/src/sgml/cfs.sgml
Lines changed: 300 additions & 0 deletions b/‎doc/src/sgml/cfs.sgml
Lines changed: 300 additions & 0 deletions
@@ -0,0 +1,300 @@
+<!-- doc/src/sgml/cfs.sgml -->
+
+<chapter id="cfs">
+ <title>Compressed file system</title>
+
+ <para>
+  This chapter explains page level compression and encryption in
+  <productname>PostgreSQL</> database system.
+ </para>
+
+ <sect1 id="cfs-overview">
+  <title>Why database compression/encryption may be useful</title>
+    
+  <para>
+    Databases are used to store larger number of text and duplicated information. This is why compression of most of databases
+    can be quite efficient and reduce used storage size 3..5 times. Postgres performs compression of TOAST data, but small 
+    text fields which fits in the page are not compressed. Also not only heap pages can be compressed, indexes on text keys 
+    or indexes with larger number of duplicate values are also good candidates for compression.
+  </para>
+
+  <para>
+    Postgres is working with disk data through buffer pool which accumulates most frequently used buffers.
+    Interface between buffer manager and file system is the most natural place for performing compression.
+    Buffers are stored on the disk in compressed for reducing disk usage and minimizing amount of data to be read.
+    And in-memory buffer pool contains uncompressed buffers, providing access to the records at the same speed as without 
+    compression. As far as modern server have large enough size of RAM, substantial part of the database can be cached in
+    memory and accessed without any compression overhead penalty.
+  </para>
+
+  <para>
+    Except obvious advantage: saving disk space, compression can also improve system performance.
+    There are two main reasons for it:
+  </para>
+
+  <variablelist>
+   <varlistentry>
+    <term>Reducing amount of disk IO</term>
+    <listitem>
+     <para>
+       Compression help to reduce size of data which should be written to the disk or read from it.
+       Compression ratio 3 actually means that you need to read 3 times less data or same number of records can be fetched 
+       3 times faster
+     </para>
+    </listitem>
+   </varlistentry>
+  
+   <varlistentry>
+    <term>Improving locality</term>
+    <listitem>
+     <para>
+       When modified buffers are flushed from buffer pool to the disk, them are now written to the random locations
+       on the disk. Postgres cache replacement algorithm makes a decision about throwing away buffer from the pool
+       based on its access frequency and ignoring its location on the disk. So two subsequently written buffers can be 
+       located in completely different parts of the disk. For HDD seek time is quite large - about 10msec, which corresponds
+       to 100 random writes per second. And speed of sequential write can be about 100Mb/sec, which corresponds to
+       10000 buffers per second (100 times faster). For SSD gap between sequential and random write speed is smaller, 
+       but still sequential writers are more efficient. How it relates to data compression?
+       Size of buffer in PostgreSQL is fixed (8kb by default). Size of compressed buffer depends on the content of the buffer.
+       So updated buffer can not always fit in its old location on the disk. This is why we can not access pages directly 
+       by its address. Instead of it we have to use map which translates logical address of the page to its physical location
+       on the disk. Definitely this extra level of indirection adds overhead. For in most cases this map can fir in memory, 
+       so page lookup is nothing more than just accessing array element. But presence of this map also have positive effect:
+       we can now write updated pages sequentially, just updating their map entries.
+       Postgres is doing much to avoid "write storm" intensive flushing of data to the disk when buffer poll space is 
+       exhausted. Compression allows to significantly reduce disk load.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+  
+  <para>
+    Another useful feature which can be combined with compression is database encryption.
+    Encryption allows to protected you database from unintended access (if somebody stole your notebook, hard drive or make 
+    copy from it, thief will not be able to extract information from your database if it is encrypted).
+    Postgres provide contrib module pgcrypto, allowing you to encrypt some particular types/columns.
+    But safer and convenient way is to encrypt all data in the database. Encryption can be combined with compression.
+    Data should be stored at disk in encrypted form and decrypted when page is loaded in buffer pool.
+    It is essential that compression should be performed before encryption, otherwise encryption eliminates regularities in 
+    data and compression rate will be close to 1.
+  </para>
+
+  <para>
+    Why do we need to perform compression/encryption in Postgres and do not use correspondent features of underlying file 
+    systems? First answer is that there are not so much file system supporting compression and encryption for all OSes.
+    And even if such file systems are available, it is not always possible/convenient to install such file system just
+    to compress/protect your database. Second question is that performing compression at database level can be more efficient, 
+    because here we can here use knowledge about size of database page and performs compression more efficiently.
+  </para>
+
+ </sect1>
+    
+ <sect1 id="cfs-implementation">
+  <title>How compression/encryption are integrated in Postgres</title>
+    
+  <para>
+    To improve efficiency of disk IO, Postgres is working with files through buffer manager, which  pins in memory
+    most frequently used pages. Each page is fixed size (8kb by default). But if we compress page, then
+    its size will depend on its content. So updated page can require more (or less) space than original page.
+    So we may not always perform in-place update of the page. Instead of it we have to locate new space for the page and somehow release
+    old space. There are two main apporaches of solving this problem:
+  </para>
+
+  <variablelist>
+   <varlistentry>
+    <term>Memory allocator</term>
+    <listitem>
+     <para>
+       We should implement our own allocator of file space. Usually, to reduce fragmentation, fixed size block allocator is used.
+       it means that we allocates pace using some fixed quantum. For example if compressed page size is 932 bytes, then we will 
+       allocate 1024 block for it in the file.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>Garbage collector</term>
+    <listitem>
+     <para>
+       We can always allocate space for the pages sequentially at the end of the file and periodically do
+       compactification (defragmentation) of the file, moving all used pages to the beginning of the file.
+       Such garbage collection process can be performed in background.
+       As it was explained in the previous section, sequential write of the flushed pages can significantly 
+       increase IO speed and some increase performance. This is why we have used this approach in CFS.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+
+  <para>
+    As far as page location is not fixed and page an be moved, we can not any more access page directly by its address and need 
+    to use extra level of indirection to map logical address of the page to it physical location in the disk.
+    It is done using memory-mapped files. In most cases this mapped will be kept in memory (size of the map is 1000 times smaller size
+    of the file) and address translation adds almost no overhead to page access time.
+    But we need to maintain this extra files: flush them during checkpoint, remove when table is dropped, include them in backup and 
+    so on... 
+  </para>
+
+  <para>
+    Postgres is storing relation in set of files, size of each file is not exceeding 2Gb. Separate page map is constructed for each file.
+    Garbage collection in CFS is done by several background workers. Number of this workers and pauses in their work can be 
+    configured by database administrator. This workers are splitting work based on inode hash, so them do not conflict with each other.
+    Each file is proceeded separately. The files is blocked for access at the time of garbage collection but complete relation is not 
+    blocked. To ensure data consistency GC creates copies of original data and map files. Once them are flushed to the disk,
+    new version of data file is atomically renamed to original file name. And then new page map data is copied to memory-mapped file
+    and backup file for page map is removed. In case of recovery after crash we first inspect if there is backup of data file.
+    If such file exists, then original file is not yet updated and we can safely remove backup files. If such file doesn't exist, 
+    then we check for presence of map file backup. If it is present, then defragmentation of this file was not completed 
+    because of crash and we complete this operation by copying map from backup file.
+  </para>
+    
+  <para>
+    CFS can be build with several compression libraries: Postgres lz, zlib, lz4, snappy, lzfse...
+    But this is build time choice: it is not possible now to dynamically choose compression algorithm.
+    CFS stores in tablespace information about used compression algorithm and produce error if Postgres is build with different
+    library.
+  </para>
+    
+  <para>
+    Encryption is performed using RC4 algorithm. Cipher key is obtained from <varname>PG_CIPHER_KEY</varname> environment variable.
+    Please notice that catalog relations are not encrypted as well as non-main forks of relation.
+  </para>
+
+ </sect1>
+    
+ <sect1 id="cfs-usage">
+  <title>Using of compression/encryption</title>
+    
+  <para>
+    Compression can be enabled for particular tablespaces. System relations are not compressed in any case.
+    It is not currently possible to alter tablespace compression option, i.e. it is not possible to compress existed tablespace
+    or visa versa - decompress compressed tablespace.
+  </para>
+
+  <para>
+    So to use compression/encryption you need to create table space with <varname>compression=true</varname> option.
+    You can make this table space default tablespace - in this case all tables will be implicitly created in this database:
+  </para>
+
+  <programlisting>
+    postgres=# create tablespace zfs location '/var/data/cfs' with (compression=true);
+    postgres=# set default_tablespace=zfs;
+  </programlisting>
+  
+  <para>
+    Encryption right now can be only combined with compression: it is not possible to use encryption without compression.
+    To enable encryption you should set <varname>cfs_encryption</varname> parameter to true and provide cipher use by setting
+    <varname>PG_CIPHER_KEY</varname> environment variable.
+  </para>
+    
+  <para>
+    CFS provides the following configuration parameters:
+  </para>
+
+    <variablelist>
+
+     <varlistentry id="cfs-encryption" xreflabel="cfs_encryption">
+      <term><varname>cfs_encryption</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>cfs_encryption</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables encryption of compressed pages. Switched off by default.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="cfs-gc-workers" xreflabel="cfs_gc_workers">
+      <term><varname>cfs_gc_workers</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>cfs_gc_workers</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Number of CFS background garbage collection workers (default: 1).
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="cfs-gc-threshold" xreflabel="cfs_gc_threshold">
+      <term><varname>cfs_gc_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>cfs_gc_threshold</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Percent of garbage in file after which file should be compactified (default: 50%).
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="cfs-gc-period" xreflabel="cfs_gc_period">
+      <term><varname>cfs_gc_period</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>cfs_gc_period</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Interval in milliseconds between CFS garbage collection iterations (default: 5 seconds)
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="cfs-gc-delay" xreflabel="cfs_gc_delay">
+      <term><varname>cfs_gc_delay</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>cfs_gc_delay</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Delay in milliseconds between files defragmentation (default: 0)
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="cfs-level" xreflabel="cfs_level">
+      <term><varname>cfs_level</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>cfs_level</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        CFS compression level (default: 1). 0 is no compression, 1 is fastest compression. 
+        Maximal compression level depends on particular compression algorithm: 9 for zlib, 19 for zstd...
+       </para>
+      </listitem>
+     </varlistentry>
+
+    </variablelist>
+
+    <para>
+      By default CFS is configured with one background worker performing garbage collection. 
+      Garbage collector traverses tablespace directory, locating map files in it and checking percent of garbage in this file.
+      When ratio of used and allocated spaces exceeds <varname>cfs_gc_threshold</> threshold, this file is defragmented.
+      The file is locked at the period of defragmentation, preventing any access to this part of relation.
+      When defragmentation is completed, garbage collection waits <varname>cfs_gc_delay</varname> milliseconds and continue directory traversal.
+      After the end of traversal, GC waits <varname>cfs_gc_period</varname> milliseconds and starts new GC iteration.
+      If there are more than one GC workers, then them split work based on hash of file inode.
+    </para>
+
+    <para>
+      It is also possible to initiate GC manually using <varname>cfs_start_gc(n_workers)</varname>  function.
+      This function returns number of workers which are actually started. Please notice that if <varname>cfs_gc_workers</varname> 
+      parameter is non zero, then GC is performed in background and <varname>cfs_start_gc</varname>  function does nothing and returns 0.
+    </para>
+
+    </para>
+      It is possible to estimate effect of table compression using <varname>cfs_estimate(relation)</varname> function.
+      This function takes first ten blocks of relation and tries to compress them ands returns average compress ratio.
+      So if returned value is 7.8 then compressed table occupies about eight time less space than original table.
+    </para>
+      
+ </sect1>
+</chapter>