Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 91c3e49

Browse files
committed
CFS branch for PGPROEE9_6
1 parent 89ace65 commit 91c3e49

File tree

3 files changed

+1309
-0
lines changed

3 files changed

+1309
-0
lines changed

doc/src/sgml/cfs.sgml

Lines changed: 300 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,300 @@
1+
<!-- doc/src/sgml/cfs.sgml -->
2+
3+
<chapter id="cfs">
4+
<title>Compressed file system</title>
5+
6+
<para>
7+
This chapter explains page level compression and encryption in
8+
<productname>PostgreSQL</> database system.
9+
</para>
10+
11+
<sect1 id="cfs-overview">
12+
<title>Why database compression/encryption may be useful</title>
13+
14+
<para>
15+
Databases are used to store larger number of text and duplicated information. This is why compression of most of databases
16+
can be quite efficient and reduce used storage size 3..5 times. Postgres performs compression of TOAST data, but small
17+
text fields which fits in the page are not compressed. Also not only heap pages can be compressed, indexes on text keys
18+
or indexes with larger number of duplicate values are also good candidates for compression.
19+
</para>
20+
21+
<para>
22+
Postgres is working with disk data through buffer pool which accumulates most frequently used buffers.
23+
Interface between buffer manager and file system is the most natural place for performing compression.
24+
Buffers are stored on the disk in compressed for reducing disk usage and minimizing amount of data to be read.
25+
And in-memory buffer pool contains uncompressed buffers, providing access to the records at the same speed as without
26+
compression. As far as modern server have large enough size of RAM, substantial part of the database can be cached in
27+
memory and accessed without any compression overhead penalty.
28+
</para>
29+
30+
<para>
31+
Except obvious advantage: saving disk space, compression can also improve system performance.
32+
There are two main reasons for it:
33+
</para>
34+
35+
<variablelist>
36+
<varlistentry>
37+
<term>Reducing amount of disk IO</term>
38+
<listitem>
39+
<para>
40+
Compression help to reduce size of data which should be written to the disk or read from it.
41+
Compression ratio 3 actually means that you need to read 3 times less data or same number of records can be fetched
42+
3 times faster
43+
</para>
44+
</listitem>
45+
</varlistentry>
46+
47+
<varlistentry>
48+
<term>Improving locality</term>
49+
<listitem>
50+
<para>
51+
When modified buffers are flushed from buffer pool to the disk, them are now written to the random locations
52+
on the disk. Postgres cache replacement algorithm makes a decision about throwing away buffer from the pool
53+
based on its access frequency and ignoring its location on the disk. So two subsequently written buffers can be
54+
located in completely different parts of the disk. For HDD seek time is quite large - about 10msec, which corresponds
55+
to 100 random writes per second. And speed of sequential write can be about 100Mb/sec, which corresponds to
56+
10000 buffers per second (100 times faster). For SSD gap between sequential and random write speed is smaller,
57+
but still sequential writers are more efficient. How it relates to data compression?
58+
Size of buffer in PostgreSQL is fixed (8kb by default). Size of compressed buffer depends on the content of the buffer.
59+
So updated buffer can not always fit in its old location on the disk. This is why we can not access pages directly
60+
by its address. Instead of it we have to use map which translates logical address of the page to its physical location
61+
on the disk. Definitely this extra level of indirection adds overhead. For in most cases this map can fir in memory,
62+
so page lookup is nothing more than just accessing array element. But presence of this map also have positive effect:
63+
we can now write updated pages sequentially, just updating their map entries.
64+
Postgres is doing much to avoid "write storm" intensive flushing of data to the disk when buffer poll space is
65+
exhausted. Compression allows to significantly reduce disk load.
66+
</para>
67+
</listitem>
68+
</varlistentry>
69+
</variablelist>
70+
71+
<para>
72+
Another useful feature which can be combined with compression is database encryption.
73+
Encryption allows to protected you database from unintended access (if somebody stole your notebook, hard drive or make
74+
copy from it, thief will not be able to extract information from your database if it is encrypted).
75+
Postgres provide contrib module pgcrypto, allowing you to encrypt some particular types/columns.
76+
But safer and convenient way is to encrypt all data in the database. Encryption can be combined with compression.
77+
Data should be stored at disk in encrypted form and decrypted when page is loaded in buffer pool.
78+
It is essential that compression should be performed before encryption, otherwise encryption eliminates regularities in
79+
data and compression rate will be close to 1.
80+
</para>
81+
82+
<para>
83+
Why do we need to perform compression/encryption in Postgres and do not use correspondent features of underlying file
84+
systems? First answer is that there are not so much file system supporting compression and encryption for all OSes.
85+
And even if such file systems are available, it is not always possible/convenient to install such file system just
86+
to compress/protect your database. Second question is that performing compression at database level can be more efficient,
87+
because here we can here use knowledge about size of database page and performs compression more efficiently.
88+
</para>
89+
90+
</sect1>
91+
92+
<sect1 id="cfs-implementation">
93+
<title>How compression/encryption are integrated in Postgres</title>
94+
95+
<para>
96+
To improve efficiency of disk IO, Postgres is working with files through buffer manager, which pins in memory
97+
most frequently used pages. Each page is fixed size (8kb by default). But if we compress page, then
98+
its size will depend on its content. So updated page can require more (or less) space than original page.
99+
So we may not always perform in-place update of the page. Instead of it we have to locate new space for the page and somehow release
100+
old space. There are two main apporaches of solving this problem:
101+
</para>
102+
103+
<variablelist>
104+
<varlistentry>
105+
<term>Memory allocator</term>
106+
<listitem>
107+
<para>
108+
We should implement our own allocator of file space. Usually, to reduce fragmentation, fixed size block allocator is used.
109+
it means that we allocates pace using some fixed quantum. For example if compressed page size is 932 bytes, then we will
110+
allocate 1024 block for it in the file.
111+
</para>
112+
</listitem>
113+
</varlistentry>
114+
115+
<varlistentry>
116+
<term>Garbage collector</term>
117+
<listitem>
118+
<para>
119+
We can always allocate space for the pages sequentially at the end of the file and periodically do
120+
compactification (defragmentation) of the file, moving all used pages to the beginning of the file.
121+
Such garbage collection process can be performed in background.
122+
As it was explained in the previous section, sequential write of the flushed pages can significantly
123+
increase IO speed and some increase performance. This is why we have used this approach in CFS.
124+
</para>
125+
</listitem>
126+
</varlistentry>
127+
</variablelist>
128+
129+
<para>
130+
As far as page location is not fixed and page an be moved, we can not any more access page directly by its address and need
131+
to use extra level of indirection to map logical address of the page to it physical location in the disk.
132+
It is done using memory-mapped files. In most cases this mapped will be kept in memory (size of the map is 1000 times smaller size
133+
of the file) and address translation adds almost no overhead to page access time.
134+
But we need to maintain this extra files: flush them during checkpoint, remove when table is dropped, include them in backup and
135+
so on...
136+
</para>
137+
138+
<para>
139+
Postgres is storing relation in set of files, size of each file is not exceeding 2Gb. Separate page map is constructed for each file.
140+
Garbage collection in CFS is done by several background workers. Number of this workers and pauses in their work can be
141+
configured by database administrator. This workers are splitting work based on inode hash, so them do not conflict with each other.
142+
Each file is proceeded separately. The files is blocked for access at the time of garbage collection but complete relation is not
143+
blocked. To ensure data consistency GC creates copies of original data and map files. Once them are flushed to the disk,
144+
new version of data file is atomically renamed to original file name. And then new page map data is copied to memory-mapped file
145+
and backup file for page map is removed. In case of recovery after crash we first inspect if there is backup of data file.
146+
If such file exists, then original file is not yet updated and we can safely remove backup files. If such file doesn't exist,
147+
then we check for presence of map file backup. If it is present, then defragmentation of this file was not completed
148+
because of crash and we complete this operation by copying map from backup file.
149+
</para>
150+
151+
<para>
152+
CFS can be build with several compression libraries: Postgres lz, zlib, lz4, snappy, lzfse...
153+
But this is build time choice: it is not possible now to dynamically choose compression algorithm.
154+
CFS stores in tablespace information about used compression algorithm and produce error if Postgres is build with different
155+
library.
156+
</para>
157+
158+
<para>
159+
Encryption is performed using RC4 algorithm. Cipher key is obtained from <varname>PG_CIPHER_KEY</varname> environment variable.
160+
Please notice that catalog relations are not encrypted as well as non-main forks of relation.
161+
</para>
162+
163+
</sect1>
164+
165+
<sect1 id="cfs-usage">
166+
<title>Using of compression/encryption</title>
167+
168+
<para>
169+
Compression can be enabled for particular tablespaces. System relations are not compressed in any case.
170+
It is not currently possible to alter tablespace compression option, i.e. it is not possible to compress existed tablespace
171+
or visa versa - decompress compressed tablespace.
172+
</para>
173+
174+
<para>
175+
So to use compression/encryption you need to create table space with <varname>compression=true</varname> option.
176+
You can make this table space default tablespace - in this case all tables will be implicitly created in this database:
177+
</para>
178+
179+
<programlisting>
180+
postgres=# create tablespace zfs location '/var/data/cfs' with (compression=true);
181+
postgres=# set default_tablespace=zfs;
182+
</programlisting>
183+
184+
<para>
185+
Encryption right now can be only combined with compression: it is not possible to use encryption without compression.
186+
To enable encryption you should set <varname>cfs_encryption</varname> parameter to true and provide cipher use by setting
187+
<varname>PG_CIPHER_KEY</varname> environment variable.
188+
</para>
189+
190+
<para>
191+
CFS provides the following configuration parameters:
192+
</para>
193+
194+
<variablelist>
195+
196+
<varlistentry id="cfs-encryption" xreflabel="cfs_encryption">
197+
<term><varname>cfs_encryption</varname> (<type>boolean</type>)
198+
<indexterm>
199+
<primary><varname>cfs_encryption</> configuration parameter</primary>
200+
</indexterm>
201+
</term>
202+
<listitem>
203+
<para>
204+
Enables encryption of compressed pages. Switched off by default.
205+
</para>
206+
</listitem>
207+
</varlistentry>
208+
209+
<varlistentry id="cfs-gc-workers" xreflabel="cfs_gc_workers">
210+
<term><varname>cfs_gc_workers</varname> (<type>integer</type>)
211+
<indexterm>
212+
<primary><varname>cfs_gc_workers</> configuration parameter</primary>
213+
</indexterm>
214+
</term>
215+
<listitem>
216+
<para>
217+
Number of CFS background garbage collection workers (default: 1).
218+
</para>
219+
</listitem>
220+
</varlistentry>
221+
222+
<varlistentry id="cfs-gc-threshold" xreflabel="cfs_gc_threshold">
223+
<term><varname>cfs_gc_threshold</varname> (<type>integer</type>)
224+
<indexterm>
225+
<primary><varname>cfs_gc_threshold</> configuration parameter</primary>
226+
</indexterm>
227+
</term>
228+
<listitem>
229+
<para>
230+
Percent of garbage in file after which file should be compactified (default: 50%).
231+
</para>
232+
</listitem>
233+
</varlistentry>
234+
235+
<varlistentry id="cfs-gc-period" xreflabel="cfs_gc_period">
236+
<term><varname>cfs_gc_period</varname> (<type>integer</type>)
237+
<indexterm>
238+
<primary><varname>cfs_gc_period</> configuration parameter</primary>
239+
</indexterm>
240+
</term>
241+
<listitem>
242+
<para>
243+
Interval in milliseconds between CFS garbage collection iterations (default: 5 seconds)
244+
</para>
245+
</listitem>
246+
</varlistentry>
247+
248+
<varlistentry id="cfs-gc-delay" xreflabel="cfs_gc_delay">
249+
<term><varname>cfs_gc_delay</varname> (<type>integer</type>)
250+
<indexterm>
251+
<primary><varname>cfs_gc_delay</> configuration parameter</primary>
252+
</indexterm>
253+
</term>
254+
<listitem>
255+
<para>
256+
Delay in milliseconds between files defragmentation (default: 0)
257+
</para>
258+
</listitem>
259+
</varlistentry>
260+
261+
<varlistentry id="cfs-level" xreflabel="cfs_level">
262+
<term><varname>cfs_level</varname> (<type>integer</type>)
263+
<indexterm>
264+
<primary><varname>cfs_level</> configuration parameter</primary>
265+
</indexterm>
266+
</term>
267+
<listitem>
268+
<para>
269+
CFS compression level (default: 1). 0 is no compression, 1 is fastest compression.
270+
Maximal compression level depends on particular compression algorithm: 9 for zlib, 19 for zstd...
271+
</para>
272+
</listitem>
273+
</varlistentry>
274+
275+
</variablelist>
276+
277+
<para>
278+
By default CFS is configured with one background worker performing garbage collection.
279+
Garbage collector traverses tablespace directory, locating map files in it and checking percent of garbage in this file.
280+
When ratio of used and allocated spaces exceeds <varname>cfs_gc_threshold</> threshold, this file is defragmented.
281+
The file is locked at the period of defragmentation, preventing any access to this part of relation.
282+
When defragmentation is completed, garbage collection waits <varname>cfs_gc_delay</varname> milliseconds and continue directory traversal.
283+
After the end of traversal, GC waits <varname>cfs_gc_period</varname> milliseconds and starts new GC iteration.
284+
If there are more than one GC workers, then them split work based on hash of file inode.
285+
</para>
286+
287+
<para>
288+
It is also possible to initiate GC manually using <varname>cfs_start_gc(n_workers)</varname> function.
289+
This function returns number of workers which are actually started. Please notice that if <varname>cfs_gc_workers</varname>
290+
parameter is non zero, then GC is performed in background and <varname>cfs_start_gc</varname> function does nothing and returns 0.
291+
</para>
292+
293+
</para>
294+
It is possible to estimate effect of table compression using <varname>cfs_estimate(relation)</varname> function.
295+
This function takes first ten blocks of relation and tries to compress them ands returns average compress ratio.
296+
So if returned value is 7.8 then compressed table occupies about eight time less space than original table.
297+
</para>
298+
299+
</sect1>
300+
</chapter>

0 commit comments

Comments
 (0)