Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 671f3e1

Browse files
committed
2 parents 61c71a4 + ecac018 commit 671f3e1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+270
-9423
lines changed

.travis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ os:
55
compiler:
66
- gcc
77
- clang
8-
install: cpanm IPC::Run DBD::Pg
8+
install: cpanm IPC::Run DBD::Pg Proc::ProcessTable
99
before_script: ./configure --enable-tap-tests && make -j4
1010
env:
1111
#- TESTDIR=.

README.md

Lines changed: 23 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,65 +1,42 @@
1-
# Postgres_cluster
1+
# postgres_cluster
22

33
[![Build Status](https://travis-ci.org/postgrespro/postgres_cluster.svg?branch=master)](https://travis-ci.org/postgrespro/postgres_cluster)
44

55
Various experiments with PostgreSQL clustering perfomed at PostgresPro.
66

7-
This is mirror of postgres repo with several changes to the core and few extra extensions.
7+
This is a mirror of postgres repo with several changes to the core and a few extra extensions.
88

99
## Core changes:
1010

11-
* Transaction manager interface (eXtensible Transaction Manager, xtm). Generic interface to plug distributed transaction engines. More info at [[https://wiki.postgresql.org/wiki/DTM]] and [[http://www.postgresql.org/message-id/flat/F2766B97-555D-424F-B29F-E0CA0F6D1D74@postgrespro.ru]].
11+
* Transaction manager interface (eXtensible Transaction Manager, xtm). Generic interface to plug distributed transaction engines. More info on [postgres wiki](https://wiki.postgresql.org/wiki/DTM) and on [the email thread](http://www.postgresql.org/message-id/flat/F2766B97-555D-424F-B29F-E0CA0F6D1D74@postgrespro.ru).
1212
* Distributed deadlock detection API.
13-
* Logical decoding of two-phase transactions.
14-
13+
* Logical decoding of transactions.
1514

1615
## New extensions:
1716

18-
* pg_tsdtm. Coordinator-less transaction management by tracking commit timestamps.
19-
* multimaster. Synchronous multi-master replication based on logical_decoding and pg_dtm.
20-
21-
22-
## Changed extension:
23-
24-
* postgres_fdw. Added support of pg_tsdtm.
25-
26-
## Installing multimaster
27-
28-
1. Build and install postgres from this repo on all machines in cluster.
29-
1. Install contrib/raftable and contrib/mmts extensions.
30-
1. Right now we need clean postgres installation to spin up multimaster cluster.
31-
1. Create required database inside postgres before enabling multimaster extension.
32-
1. We are requiring following postgres configuration:
33-
* 'max_prepared_transactions' > 0 -- in multimaster all writing transaction along with ddl are wrapped as two-phase transaction, so this number will limit maximum number of writing transactions in this cluster node.
34-
* 'synchronous_commit - off' -- right now we do not support async commit. (one can enable it, but that will not bring desired effect)
35-
* 'wal_level = logical' -- multimaster built on top of logical replication so this is mandatory.
36-
* 'max_wal_senders' -- this should be at least number of nodes - 1
37-
* 'max_replication_slots' -- this should be at least number of nodes - 1
38-
* 'max_worker_processes' -- at least 2*N + 1 + P, where N is number of nodes in cluster, P size of pool of workers(see below) (1 raftable, n-1 receiver, n-1 sender, mtm-sender, mtm-receiver, + number of pool worker).
39-
* 'default_transaction_isolation = 'repeatable read'' -- multimaster isn't supporting default read commited level.
40-
1. Multimaster have following configuration parameters:
41-
* 'multimaster.conn_strings' -- connstrings for all nodes in cluster, separated by comma.
42-
* 'multimaster.node_id' -- id of current node, number starting from one.
43-
* 'multimaster.workers' -- number of workers that can apply transactions from neighbouring nodes.
44-
* 'multimaster.use_raftable = true' -- just set this to true. Deprecated.
45-
* 'multimaster.queue_size = 52857600' -- queue size for applying transactions from neighbouring nodes.
46-
* 'multimaster.ignore_tables_without_pk = 1' -- do not replicate tables without primary key
47-
* 'multimaster.heartbeat_send_timeout = 250' -- heartbeat period (ms).
48-
* 'multimaster.heartbeat_recv_timeout = 1000' -- disconnect node if we miss heartbeats all that time (ms).
49-
* 'multimaster.twopc_min_timeout = 40000' -- rollback stalled transaction after this period (ms).
50-
* 'raftable.id' -- id of current node, number starting from one.
51-
* 'raftable.peers' -- id of current node, number starting from one.
52-
1. Allow replication in pg_hba.conf.
53-
54-
## Multimaster status functions
17+
The following table describes the features and the way they are implemented in our four main extensions:
5518

56-
* mtm.get_nodes_state() -- show status of nodes on cluster
57-
* mtm.get_cluster_state() -- show whole cluster status
58-
* mtm.get_cluster_info() -- print some debug info
59-
* mtm.make_table_local(relation regclass) -- stop replication for a given table
19+
| |commit timestamps |snapshot sharing |
20+
|---------------------------:|:----------------------------:|:----------------------------------:|
21+
|**distributed transactions**|[`pg_tsdtm`](contrib/pg_tsdtm)|[`pg_dtm`](contrib/pg_dtm) |
22+
|**multimaster replication** |[`mmts`](contrib/mmts) |[`multimaster`](contrib/multimaster)|
6023

24+
### [`mmts`](contrib/mmts)
25+
An implementation of synchronous **multi-master replication** based on **commit timestamps**.
6126

27+
### [`multimaster`](contrib/multimaster)
28+
An implementation of synchronous **multi-master replication** based on **snapshot sharing**.
6229

30+
### [`pg_dtm`](contrib/pg_dtm)
31+
An implementation of **distributed transaction** management based on **snapshot sharing**.
6332

33+
### [`pg_tsdtm`](contrib/pg_tsdtm)
34+
An implementation of **distributed transaction** management based on **commit timestamps**.
6435

36+
### [`arbiter`](contrib/arbiter)
37+
A distributed transaction management daemon.
38+
Used by `pg_dtm` and `multimaster`.
6539

40+
### [`raftable`](contrib/raftable)
41+
A key-value table replicated over Raft protocol.
42+
Used by `mmts`.

contrib/mmts/Cluster.pm

Lines changed: 65 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ package Cluster;
33
use strict;
44
use warnings;
55

6+
use Proc::ProcessTable;
67
use PostgresNode;
78
use TestLib;
89
use Test::More;
@@ -103,6 +104,7 @@ sub configure
103104
multimaster.use_raftable = true
104105
multimaster.heartbeat_recv_timeout = 1000
105106
multimaster.heartbeat_send_timeout = 250
107+
multimaster.max_nodes = 3
106108
multimaster.ignore_tables_without_pk = true
107109
multimaster.twopc_min_timeout = 2000
108110
));
@@ -129,13 +131,21 @@ sub start
129131
sub stopnode
130132
{
131133
my ($node, $mode) = @_;
132-
my $port = $node->port;
133-
my $pgdata = $node->data_dir;
134-
my $name = $node->name;
134+
return 1 unless defined $node->{_pid};
135135
$mode = 'fast' unless defined $mode;
136-
diag("stopping node $name ${mode}ly at $pgdata port $port");
137-
next unless defined $node->{_pid};
136+
my $name = $node->name;
137+
diag("stopping $name ${mode}ly");
138+
139+
if ($mode eq 'kill') {
140+
killtree($node->{_pid});
141+
return 1;
142+
}
143+
144+
my $pgdata = $node->data_dir;
138145
my $ret = TestLib::system_log('pg_ctl', '-D', $pgdata, '-m', 'fast', 'stop');
146+
my $pidfile = $node->data_dir . "/postmaster.pid";
147+
diag("unlink $pidfile");
148+
unlink $pidfile;
139149
$node->{_pid} = undef;
140150
$node->_update_pid;
141151

@@ -147,6 +157,51 @@ sub stopnode
147157
return 1;
148158
}
149159

160+
sub stopid
161+
{
162+
my ($self, $idx, $mode) = @_;
163+
return stopnode($self->{nodes}->[$idx]);
164+
}
165+
166+
sub killtree
167+
{
168+
my $root = shift;
169+
diag("killtree $root\n");
170+
171+
my $t = new Proc::ProcessTable;
172+
173+
my %parent = ();
174+
#my %cmd = ();
175+
foreach my $p (@{$t->table}) {
176+
$parent{$p->pid} = $p->ppid;
177+
# $cmd{$p->pid} = $p->cmndline;
178+
}
179+
180+
if (!defined $root) {
181+
return;
182+
}
183+
my @queue = ($root);
184+
my @killist = ();
185+
186+
while (scalar @queue) {
187+
my $victim = shift @queue;
188+
while (my ($pid, $ppid) = each %parent) {
189+
if ($ppid == $victim) {
190+
push @queue, $pid;
191+
}
192+
}
193+
diag("SIGSTOP to $victim");
194+
kill 'STOP', $victim;
195+
unshift @killist, $victim;
196+
}
197+
198+
diag("SIGKILL to " . join(' ', @killist));
199+
kill 'KILL', @killist;
200+
#foreach my $victim (@killist) {
201+
# print("kill $victim " . $cmd{$victim} . "\n");
202+
#}
203+
}
204+
150205
sub stop
151206
{
152207
my ($self, $mode) = @_;
@@ -155,12 +210,13 @@ sub stop
155210

156211
my $ok = 1;
157212
diag("stopping cluster ${mode}ly");
158-
foreach my $node (@$nodes)
159-
{
213+
214+
foreach my $node (@$nodes) {
160215
if (!stopnode($node, $mode)) {
161216
$ok = 0;
162-
if (!stopnode($node, 'immediate')) {
163-
BAIL_OUT("failed to stop $node immediately");
217+
if (!stopnode($node, 'kill')) {
218+
my $name = $node->name;
219+
BAIL_OUT("failed to kill $name");
164220
}
165221
}
166222
}

contrib/mmts/README.md

Lines changed: 52 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,55 @@
1-
# Postgres Multimaster
1+
# `mmts`
2+
3+
An implementation of synchronous **multi-master replication** based on **commit timestamps**.
4+
5+
## Usage
6+
7+
1. Install `contrib/raftable` and `contrib/mmts` on each instance.
8+
1. Add these required options to the `postgresql.conf` of each instance in the cluster.
9+
10+
```sh
11+
max_prepared_transactions = 200 # should be > 0, because all
12+
# transactions are implicitly two-phase
13+
max_connections = 200
14+
max_worker_processes = 100 # at least (2 * n + p + 1)
15+
# this figure is calculated as:
16+
# 1 raftable worker
17+
# n-1 receiver
18+
# n-1 sender
19+
# 1 mtm-sender
20+
# 1 mtm-receiver
21+
# p workers in the pool
22+
max_parallel_degree = 0
23+
wal_level = logical # multimaster is build on top of
24+
# logical replication and will not work otherwise
25+
max_wal_senders = 10 # at least the number of nodes
26+
wal_sender_timeout = 0
27+
default_transaction_isolation = 'repeatable read'
28+
max_replication_slots = 10 # at least the number of nodes
29+
shared_preload_libraries = 'raftable,multimaster'
30+
multimaster.workers = 10
31+
multimaster.queue_size = 10485760 # 10mb
32+
multimaster.node_id = 1 # the 1-based index of the node in the cluster
33+
multimaster.conn_strings = 'dbname=... host=....0.0.1 port=... raftport=..., ...'
34+
# comma-separated list of connection strings
35+
multimaster.use_raftable = true
36+
multimaster.heartbeat_recv_timeout = 1000
37+
multimaster.heartbeat_send_timeout = 250
38+
multimaster.ignore_tables_without_pk = true
39+
multimaster.twopc_min_timeout = 2000
40+
```
41+
1. Allow replication in `pg_hba.conf`.
42+
43+
## Status functions
44+
45+
`create extension mmts;` to gain access to these functions:
46+
47+
* `mtm.get_nodes_state()` -- show status of nodes on cluster
48+
* `mtm.get_cluster_state()` -- show whole cluster status
49+
* `mtm.get_cluster_info()` -- print some debug info
50+
* `mtm.make_table_local(relation regclass)` -- stop replication for a given table
251

352
## Testing
453

5-
The testing process involves multiple modules that perform different tasks. The
6-
modules and their APIs are listed below.
7-
8-
### Modules
9-
10-
#### `combineaux`
11-
12-
Governs the whole testing process. Runs different workloads during different
13-
troubles.
14-
15-
#### `stresseaux`
16-
17-
Puts workloads against the database. Writes logs that are later used by
18-
`valideaux`.
19-
20-
* `start(id, workload, cluster)` - starts a `workload` against the `cluster`
21-
and call it `id`.
22-
* `stop(id)` - stops a previously started workload called `id`.
23-
24-
#### `starteaux`
25-
26-
Manages the database nodes.
27-
28-
* `deploy(driver, ...)` - deploys a cluster using the specified `driver` and
29-
other parameters specific to that driver. Returns a `cluster` instance that is
30-
used in other methods.
31-
* `cluster->up(id)` - adds a node named `id` to the `cluster`.
32-
* `cluster->down(id)` - removes a node named `id` from the `cluster`.
33-
* `cluster->drop(src, dst, ratio)` - drop `ratio` packets flowing from node
34-
`src` to node `dst`.
35-
* `cluster->delay(src, dst, msec)` - delay packets flowing from node `src` to
36-
node `dst` by `msec` milliseconds.
37-
38-
#### `troubleaux`
39-
40-
This is the troublemaker that messes with the network, nodes and time.
41-
42-
* `cause(cluster, trouble, ...)` - causes the specified `trouble` in the
43-
specified `cluster` with some trouble-specific parameters.
44-
* `fix(cluster)` - fixes all troubles caused in the `cluster`.
45-
46-
#### `valideaux`
47-
48-
Validates the logs of stresseaux.
49-
50-
#### `reporteaux`
51-
52-
Generates reports on the test results. This is usually a table that with
53-
`trouble` vs `workload` axes.
54+
* `make -C contrib/mmts check` to run TAP-tests.
55+
* `make -C contrib/mmts xcheck` to run blockade tests. The blockade tests require `docker`, `blockade`, and some other packages installed, see [requirements.txt](tests2/requirements.txt) for the list. You might also want to gain superuser privileges to run these tests successfully.

contrib/mmts/arbiter.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -340,8 +340,8 @@ static void MtmScheduleHeartbeat()
340340
if (!stop) {
341341
enable_timeout_after(heartbeat_timer, MtmHeartbeatSendTimeout);
342342
send_heartbeat = true;
343-
PGSemaphoreUnlock(&Mtm->votingSemaphore);
344343
}
344+
PGSemaphoreUnlock(&Mtm->votingSemaphore);
345345
}
346346

347347
static void MtmSendHeartbeat()
@@ -377,7 +377,7 @@ static void MtmSendHeartbeat()
377377

378378
void MtmCheckHeartbeat()
379379
{
380-
if (send_heartbeat) {
380+
if (send_heartbeat && !stop) {
381381
send_heartbeat = false;
382382
enable_timeout_after(heartbeat_timer, MtmHeartbeatSendTimeout);
383383
MtmSendHeartbeat();

0 commit comments

Comments
 (0)