|
| 1 | +<!-- doc/src/sgml/cfs.sgml --> |
| 2 | + |
| 3 | +<chapter id="cfs"> |
| 4 | + <title>Compressed file system</title> |
| 5 | + |
| 6 | + <para> |
| 7 | + This chapter explains page level compression and encryption in |
| 8 | + <productname>PostgreSQL</> database system. |
| 9 | + </para> |
| 10 | + |
| 11 | + <sect1 id="cfs-overview"> |
| 12 | + <title>Why database compression/encryption may be useful</title> |
| 13 | + |
| 14 | + <para> |
| 15 | + Databases are used to store larger number of text and duplicated information. This is why compression of most of databases |
| 16 | + can be quite efficient and reduce used storage size 3..5 times. Postgres performs compression of TOAST data, but small |
| 17 | + text fields which fits in the page are not compressed. Also not only heap pages can be compressed, indexes on text keys |
| 18 | + or indexes with larger number of duplicate values are also good candidates for compression. |
| 19 | + </para> |
| 20 | + |
| 21 | + <para> |
| 22 | + Postgres is working with disk data through buffer pool which accumulates most frequently used buffers. |
| 23 | + Interface between buffer manager and file system is the most natural place for performing compression. |
| 24 | + Buffers are stored on the disk in compressed for reducing disk usage and minimizing amount of data to be read. |
| 25 | + And in-memory buffer pool contains uncompressed buffers, providing access to the records at the same speed as without |
| 26 | + compression. As far as modern server have large enough size of RAM, substantial part of the database can be cached in |
| 27 | + memory and accessed without any compression overhead penalty. |
| 28 | + </para> |
| 29 | + |
| 30 | + <para> |
| 31 | + Except obvious advantage: saving disk space, compression can also improve system performance. |
| 32 | + There are two main reasons for it: |
| 33 | + </para> |
| 34 | + |
| 35 | + <variablelist> |
| 36 | + <varlistentry> |
| 37 | + <term>Reducing amount of disk IO</term> |
| 38 | + <listitem> |
| 39 | + <para> |
| 40 | + Compression help to reduce size of data which should be written to the disk or read from it. |
| 41 | + Compression ratio 3 actually means that you need to read 3 times less data or same number of records can be fetched |
| 42 | + 3 times faster |
| 43 | + </para> |
| 44 | + </listitem> |
| 45 | + </varlistentry> |
| 46 | + |
| 47 | + <varlistentry> |
| 48 | + <term>Improving locality</term> |
| 49 | + <listitem> |
| 50 | + <para> |
| 51 | + When modified buffers are flushed from buffer pool to the disk, them are now written to the random locations |
| 52 | + on the disk. Postgres cache replacement algorithm makes a decision about throwing away buffer from the pool |
| 53 | + based on its access frequency and ignoring its location on the disk. So two subsequently written buffers can be |
| 54 | + located in completely different parts of the disk. For HDD seek time is quite large - about 10msec, which corresponds |
| 55 | + to 100 random writes per second. And speed of sequential write can be about 100Mb/sec, which corresponds to |
| 56 | + 10000 buffers per second (100 times faster). For SSD gap between sequential and random write speed is smaller, |
| 57 | + but still sequential writers are more efficient. How it relates to data compression? |
| 58 | + Size of buffer in PostgreSQL is fixed (8kb by default). Size of compressed buffer depends on the content of the buffer. |
| 59 | + So updated buffer can not always fit in its old location on the disk. This is why we can not access pages directly |
| 60 | + by its address. Instead of it we have to use map which translates logical address of the page to its physical location |
| 61 | + on the disk. Definitely this extra level of indirection adds overhead. For in most cases this map can fir in memory, |
| 62 | + so page lookup is nothing more than just accessing array element. But presence of this map also have positive effect: |
| 63 | + we can now write updated pages sequentially, just updating their map entries. |
| 64 | + Postgres is doing much to avoid "write storm" intensive flushing of data to the disk when buffer poll space is |
| 65 | + exhausted. Compression allows to significantly reduce disk load. |
| 66 | + </para> |
| 67 | + </listitem> |
| 68 | + </varlistentry> |
| 69 | + </variablelist> |
| 70 | + |
| 71 | + <para> |
| 72 | + Another useful feature which can be combined with compression is database encryption. |
| 73 | + Encryption allows to protected you database from unintended access (if somebody stole your notebook, hard drive or make |
| 74 | + copy from it, thief will not be able to extract information from your database if it is encrypted). |
| 75 | + Postgres provide contrib module pgcrypto, allowing you to encrypt some particular types/columns. |
| 76 | + But safer and convenient way is to encrypt all data in the database. Encryption can be combined with compression. |
| 77 | + Data should be stored at disk in encrypted form and decrypted when page is loaded in buffer pool. |
| 78 | + It is essential that compression should be performed before encryption, otherwise encryption eliminates regularities in |
| 79 | + data and compression rate will be close to 1. |
| 80 | + </para> |
| 81 | + |
| 82 | + <para> |
| 83 | + Why do we need to perform compression/encryption in Postgres and do not use correspondent features of underlying file |
| 84 | + systems? First answer is that there are not so much file system supporting compression and encryption for all OSes. |
| 85 | + And even if such file systems are available, it is not always possible/convenient to install such file system just |
| 86 | + to compress/protect your database. Second question is that performing compression at database level can be more efficient, |
| 87 | + because here we can here use knowledge about size of database page and performs compression more efficiently. |
| 88 | + </para> |
| 89 | + |
| 90 | + </sect1> |
| 91 | + |
| 92 | + <sect1 id="cfs-implementation"> |
| 93 | + <title>How compression/encryption are integrated in Postgres</title> |
| 94 | + |
| 95 | + <para> |
| 96 | + To improve efficiency of disk IO, Postgres is working with files through buffer manager, which pins in memory |
| 97 | + most frequently used pages. Each page is fixed size (8kb by default). But if we compress page, then |
| 98 | + its size will depend on its content. So updated page can require more (or less) space than original page. |
| 99 | + So we may not always perform in-place update of the page. Instead of it we have to locate new space for the page and somehow release |
| 100 | + old space. There are two main apporaches of solving this problem: |
| 101 | + </para> |
| 102 | + |
| 103 | + <variablelist> |
| 104 | + <varlistentry> |
| 105 | + <term>Memory allocator</term> |
| 106 | + <listitem> |
| 107 | + <para> |
| 108 | + We should implement our own allocator of file space. Usually, to reduce fragmentation, fixed size block allocator is used. |
| 109 | + it means that we allocates pace using some fixed quantum. For example if compressed page size is 932 bytes, then we will |
| 110 | + allocate 1024 block for it in the file. |
| 111 | + </para> |
| 112 | + </listitem> |
| 113 | + </varlistentry> |
| 114 | + |
| 115 | + <varlistentry> |
| 116 | + <term>Garbage collector</term> |
| 117 | + <listitem> |
| 118 | + <para> |
| 119 | + We can always allocate space for the pages sequentially at the end of the file and periodically do |
| 120 | + compactification (defragmentation) of the file, moving all used pages to the beginning of the file. |
| 121 | + Such garbage collection process can be performed in background. |
| 122 | + As it was explained in the previous section, sequential write of the flushed pages can significantly |
| 123 | + increase IO speed and some increase performance. This is why we have used this approach in CFS. |
| 124 | + </para> |
| 125 | + </listitem> |
| 126 | + </varlistentry> |
| 127 | + </variablelist> |
| 128 | + |
| 129 | + <para> |
| 130 | + As far as page location is not fixed and page an be moved, we can not any more access page directly by its address and need |
| 131 | + to use extra level of indirection to map logical address of the page to it physical location in the disk. |
| 132 | + It is done using memory-mapped files. In most cases this mapped will be kept in memory (size of the map is 1000 times smaller size |
| 133 | + of the file) and address translation adds almost no overhead to page access time. |
| 134 | + But we need to maintain this extra files: flush them during checkpoint, remove when table is dropped, include them in backup and |
| 135 | + so on... |
| 136 | + </para> |
| 137 | + |
| 138 | + <para> |
| 139 | + Postgres is storing relation in set of files, size of each file is not exceeding 2Gb. Separate page map is constructed for each file. |
| 140 | + Garbage collection in CFS is done by several background workers. Number of this workers and pauses in their work can be |
| 141 | + configured by database administrator. This workers are splitting work based on inode hash, so them do not conflict with each other. |
| 142 | + Each file is proceeded separately. The files is blocked for access at the time of garbage collection but complete relation is not |
| 143 | + blocked. To ensure data consistency GC creates copies of original data and map files. Once them are flushed to the disk, |
| 144 | + new version of data file is atomically renamed to original file name. And then new page map data is copied to memory-mapped file |
| 145 | + and backup file for page map is removed. In case of recovery after crash we first inspect if there is backup of data file. |
| 146 | + If such file exists, then original file is not yet updated and we can safely remove backup files. If such file doesn't exist, |
| 147 | + then we check for presence of map file backup. If it is present, then defragmentation of this file was not completed |
| 148 | + because of crash and we complete this operation by copying map from backup file. |
| 149 | + </para> |
| 150 | + |
| 151 | + <para> |
| 152 | + CFS can be build with several compression libraries: Postgres lz, zlib, lz4, snappy, lzfse... |
| 153 | + But this is build time choice: it is not possible now to dynamically choose compression algorithm. |
| 154 | + CFS stores in tablespace information about used compression algorithm and produce error if Postgres is build with different |
| 155 | + library. |
| 156 | + </para> |
| 157 | + |
| 158 | + <para> |
| 159 | + Encryption is performed using RC4 algorithm. Cipher key is obtained from <varname>PG_CIPHER_KEY</varname> environment variable. |
| 160 | + Please notice that catalog relations are not encrypted as well as non-main forks of relation. |
| 161 | + </para> |
| 162 | + |
| 163 | + </sect1> |
| 164 | + |
| 165 | + <sect1 id="cfs-usage"> |
| 166 | + <title>Using of compression/encryption</title> |
| 167 | + |
| 168 | + <para> |
| 169 | + Compression can be enabled for particular tablespaces. System relations are not compressed in any case. |
| 170 | + It is not currently possible to alter tablespace compression option, i.e. it is not possible to compress existed tablespace |
| 171 | + or visa versa - decompress compressed tablespace. |
| 172 | + </para> |
| 173 | + |
| 174 | + <para> |
| 175 | + So to use compression/encryption you need to create table space with <varname>compression=true</varname> option. |
| 176 | + You can make this table space default tablespace - in this case all tables will be implicitly created in this database: |
| 177 | + </para> |
| 178 | + |
| 179 | + <programlisting> |
| 180 | + postgres=# create tablespace zfs location '/var/data/cfs' with (compression=true); |
| 181 | + postgres=# set default_tablespace=zfs; |
| 182 | + </programlisting> |
| 183 | + |
| 184 | + <para> |
| 185 | + Encryption right now can be only combined with compression: it is not possible to use encryption without compression. |
| 186 | + To enable encryption you should set <varname>cfs_encryption</varname> parameter to true and provide cipher use by setting |
| 187 | + <varname>PG_CIPHER_KEY</varname> environment variable. |
| 188 | + </para> |
| 189 | + |
| 190 | + <para> |
| 191 | + CFS provides the following configuration parameters: |
| 192 | + </para> |
| 193 | + |
| 194 | + <variablelist> |
| 195 | + |
| 196 | + <varlistentry id="cfs-encryption" xreflabel="cfs_encryption"> |
| 197 | + <term><varname>cfs_encryption</varname> (<type>boolean</type>) |
| 198 | + <indexterm> |
| 199 | + <primary><varname>cfs_encryption</> configuration parameter</primary> |
| 200 | + </indexterm> |
| 201 | + </term> |
| 202 | + <listitem> |
| 203 | + <para> |
| 204 | + Enables encryption of compressed pages. Switched off by default. |
| 205 | + </para> |
| 206 | + </listitem> |
| 207 | + </varlistentry> |
| 208 | + |
| 209 | + <varlistentry id="cfs-gc-workers" xreflabel="cfs_gc_workers"> |
| 210 | + <term><varname>cfs_gc_workers</varname> (<type>integer</type>) |
| 211 | + <indexterm> |
| 212 | + <primary><varname>cfs_gc_workers</> configuration parameter</primary> |
| 213 | + </indexterm> |
| 214 | + </term> |
| 215 | + <listitem> |
| 216 | + <para> |
| 217 | + Number of CFS background garbage collection workers (default: 1). |
| 218 | + </para> |
| 219 | + </listitem> |
| 220 | + </varlistentry> |
| 221 | + |
| 222 | + <varlistentry id="cfs-gc-threshold" xreflabel="cfs_gc_threshold"> |
| 223 | + <term><varname>cfs_gc_threshold</varname> (<type>integer</type>) |
| 224 | + <indexterm> |
| 225 | + <primary><varname>cfs_gc_threshold</> configuration parameter</primary> |
| 226 | + </indexterm> |
| 227 | + </term> |
| 228 | + <listitem> |
| 229 | + <para> |
| 230 | + Percent of garbage in file after which file should be compactified (default: 50%). |
| 231 | + </para> |
| 232 | + </listitem> |
| 233 | + </varlistentry> |
| 234 | + |
| 235 | + <varlistentry id="cfs-gc-period" xreflabel="cfs_gc_period"> |
| 236 | + <term><varname>cfs_gc_period</varname> (<type>integer</type>) |
| 237 | + <indexterm> |
| 238 | + <primary><varname>cfs_gc_period</> configuration parameter</primary> |
| 239 | + </indexterm> |
| 240 | + </term> |
| 241 | + <listitem> |
| 242 | + <para> |
| 243 | + Interval in milliseconds between CFS garbage collection iterations (default: 5 seconds) |
| 244 | + </para> |
| 245 | + </listitem> |
| 246 | + </varlistentry> |
| 247 | + |
| 248 | + <varlistentry id="cfs-gc-delay" xreflabel="cfs_gc_delay"> |
| 249 | + <term><varname>cfs_gc_delay</varname> (<type>integer</type>) |
| 250 | + <indexterm> |
| 251 | + <primary><varname>cfs_gc_delay</> configuration parameter</primary> |
| 252 | + </indexterm> |
| 253 | + </term> |
| 254 | + <listitem> |
| 255 | + <para> |
| 256 | + Delay in milliseconds between files defragmentation (default: 0) |
| 257 | + </para> |
| 258 | + </listitem> |
| 259 | + </varlistentry> |
| 260 | + |
| 261 | + <varlistentry id="cfs-level" xreflabel="cfs_level"> |
| 262 | + <term><varname>cfs_level</varname> (<type>integer</type>) |
| 263 | + <indexterm> |
| 264 | + <primary><varname>cfs_level</> configuration parameter</primary> |
| 265 | + </indexterm> |
| 266 | + </term> |
| 267 | + <listitem> |
| 268 | + <para> |
| 269 | + CFS compression level (default: 1). 0 is no compression, 1 is fastest compression. |
| 270 | + Maximal compression level depends on particular compression algorithm: 9 for zlib, 19 for zstd... |
| 271 | + </para> |
| 272 | + </listitem> |
| 273 | + </varlistentry> |
| 274 | + |
| 275 | + </variablelist> |
| 276 | + |
| 277 | + <para> |
| 278 | + By default CFS is configured with one background worker performing garbage collection. |
| 279 | + Garbage collector traverses tablespace directory, locating map files in it and checking percent of garbage in this file. |
| 280 | + When ratio of used and allocated spaces exceeds <varname>cfs_gc_threshold</> threshold, this file is defragmented. |
| 281 | + The file is locked at the period of defragmentation, preventing any access to this part of relation. |
| 282 | + When defragmentation is completed, garbage collection waits <varname>cfs_gc_delay</varname> milliseconds and continue directory traversal. |
| 283 | + After the end of traversal, GC waits <varname>cfs_gc_period</varname> milliseconds and starts new GC iteration. |
| 284 | + If there are more than one GC workers, then them split work based on hash of file inode. |
| 285 | + </para> |
| 286 | + |
| 287 | + <para> |
| 288 | + It is also possible to initiate GC manually using <varname>cfs_start_gc(n_workers)</varname> function. |
| 289 | + This function returns number of workers which are actually started. Please notice that if <varname>cfs_gc_workers</varname> |
| 290 | + parameter is non zero, then GC is performed in background and <varname>cfs_start_gc</varname> function does nothing and returns 0. |
| 291 | + </para> |
| 292 | + |
| 293 | + </para> |
| 294 | + It is possible to estimate effect of table compression using <varname>cfs_estimate(relation)</varname> function. |
| 295 | + This function takes first ten blocks of relation and tries to compress them ands returns average compress ratio. |
| 296 | + So if returned value is 7.8 then compressed table occupies about eight time less space than original table. |
| 297 | + </para> |
| 298 | + |
| 299 | + </sect1> |
| 300 | +</chapter> |
0 commit comments