|
1 |
| -How to build: |
2 |
| -PostgreSQL location is derived from pg_config, you can also specify path to it |
3 |
| -in PG_CONFIG var. |
| 1 | +First, some terminology: |
| 2 | +'shardlord' or 'lord' is postgres instance and background process (bgw) spinning |
| 3 | + on it which manages sharding. In some places it is still called 'shardmaster' |
| 4 | + or 'master'. |
| 5 | +'worker nodes' or 'workers' are other nodes with data. |
| 6 | +'sharded table' is table managed by shardman. |
| 7 | +'shard' or 'partition' is any table containing part of sharded table. |
| 8 | +'primary' is main partition of sharded table, i.e. the only writable |
| 9 | + partition. |
| 10 | +'replica' is secondary partition of sharded table, i.e. read-only partition. |
| 11 | +'cluster' -- the whole system of shardlord and workers, or cluster in PostgreSQL |
| 12 | + sense, this should be clear from the context. |
| 13 | + |
| 14 | +For quick setup, see scripts in bin/directory. Setup is configured in file |
| 15 | +common.sh. shardman_init.sh performs initdb for shardlord & workers, deploys |
| 16 | +example configs and creates extension; shardman_start.sh reinstalls extension, |
| 17 | +which is useful for development. |
| 18 | + |
| 19 | +Both shardlord and workers require extension built and installed. We depend |
| 20 | +on pg_pathman extension so it must be installed too. |
| 21 | +PostgreSQL location for building is derived from pg_config, you can also specify |
| 22 | +path to it in PG_CONFIG var. The whole process is of building and copying files |
| 23 | +to PG server is just: |
4 | 24 |
|
5 | 25 | git clone
|
6 | 26 | cd pg_shardman
|
7 |
| -make |
8 | 27 | make install
|
9 | 28 |
|
10 |
| -add to postgresql.conf |
11 |
| -shared_preload_libraries = '$libdir/pg_shardman' |
| 29 | +To actually install extension, add pg_shardman and pg_pathman to |
| 30 | +shared_preload_libraries, restart the server and run |
| 31 | + |
| 32 | +create extension pg_shardman cascade; |
| 33 | + |
| 34 | +Have a look at postgresql.conf.common.template and postgresql.conf.lord.template |
| 35 | +example configuration files. The former contains all shardman's and important |
| 36 | +PostgreSQL GUCs for either shardlord and workers, the latter for shardlord only |
| 37 | +-- in particular, shardman.master defines whether the instance is shardlord or |
| 38 | +not. |
| 39 | + |
| 40 | +Immediately after starting the server with shardman library preloaded, but |
| 41 | +before creating extension you will receive on shardlord warning like |
| 42 | + |
| 43 | +WARNING: pg_shardman library is preloaded on shardlord, but extenstion is not |
| 44 | + created |
| 45 | + |
| 46 | +This is normal as we have here kind of circular dependency: it is pointless to |
| 47 | +create extension without the library, and library also uses SQL objects, so |
| 48 | +shardlord won't start without installed extension. |
| 49 | + |
| 50 | +Currently extension scheme is fixed, it is, who would have though, 'shardman'. |
| 51 | + |
| 52 | +Now you can issue commands to the shardlord. All shardman commands (cmds) you |
| 53 | +issue return immediately because they technically just submit the cmd to the |
| 54 | +shardlord; he learns about them and starts the actual execution. At any time you |
| 55 | +can cancel currently executing command, just send SIGUSR1 to the shardlord. This |
| 56 | +is not yet implemented as a handy SQL function, but you can use cancel_cmd.sh |
| 57 | +script from bin/ directory. All submitted cmds return unique command id which is |
| 58 | +used to check the cmd status later by querying shardman.cmd_log and |
| 59 | +shardman.cmd_opts tables: |
| 60 | + |
| 61 | +CREATE TABLE cmd_log ( |
| 62 | + id bigserial PRIMARY KEY, |
| 63 | + cmd_type cmd NOT NULL, |
| 64 | + status cmd_status DEFAULT 'waiting' NOT NULL |
| 65 | +); |
| 66 | +CREATE TABLE cmd_opts ( |
| 67 | + id bigserial PRIMARY KEY, |
| 68 | + cmd_id bigint REFERENCES cmd_log(id), |
| 69 | + opt text |
| 70 | +); |
| 71 | + |
| 72 | +We will unite them into convenient view someday. Commands status is enum with |
| 73 | +mostly obvious values ('waiting', 'canceled', 'failed', 'in progress', |
| 74 | +'success', 'done'). You might wonder what is the difference between 'success' |
| 75 | +and 'done'. We set the latter when the command is not atomic itself, but |
| 76 | +consists of several atomic steps, some of which were probably executed |
| 77 | +successfully and some failed. |
| 78 | + |
| 79 | +Currently cmd_log can be seen and commands issued only on the shardlord, but |
| 80 | +that's easy to change. |
| 81 | + |
| 82 | +Let's get to the actual commands. |
| 83 | + |
| 84 | +add_node(connstring text) |
| 85 | +Add node with given connstring to the cluster. Node is assigned unique id. If |
| 86 | +node previously contained shardman state from old cluster (not one managed by |
| 87 | +current shardlord), this state will be lost. |
| 88 | + |
| 89 | +rm_node(node_id int) |
| 90 | +Remove node from the cluster. Its shardman state will be reset. We don't delete |
| 91 | +tables with data and foreign tables though. |
| 92 | + |
| 93 | +You can see all cluster nodes at any time by examining shardman.nodes table: |
| 94 | +-- active is the normal mode, others needed only for proper node add and removal |
| 95 | +CREATE TYPE worker_node_status AS ENUM ( |
| 96 | + 'active', 'add_in_progress', 'rm_in_progress', 'removed'); |
| 97 | +CREATE TABLE nodes ( |
| 98 | + id serial PRIMARY KEY, |
| 99 | + connstring text NOT NULL UNIQUE, |
| 100 | + worker_status worker_node_status, |
| 101 | + -- While currently we don't support master and worker roles on one node, |
| 102 | + -- potentially node can be either worker, master or both, so we need 2 bits. |
| 103 | + -- One bool with NULL might be fine, but it seems a bit counter-intuitive. |
| 104 | + worker bool NOT NULL DEFAULT true, |
| 105 | + master bool NOT NULL DEFAULT false, |
| 106 | + -- cmd by which node was added |
| 107 | + added_by bigint REFERENCES shardman.cmd_log(id) |
| 108 | +); |
| 109 | + |
| 110 | +create_hash_partitions( |
| 111 | + node_id int, relation text, expr text, partitions_count int, |
| 112 | + rebalance bool DEFAULT true) |
| 113 | +Hash-shard table 'relation' lying on node 'node_id' by key 'expr', creating |
| 114 | +'partitions_count' shards. As you probably noticed, the signature mirrors |
| 115 | +pathman's function with the same name. If 'rebalance' is false, we just |
| 116 | +partition table locally, making other nodes aware about it. If it is true, |
| 117 | +we also immediately run 'rebalance' function on the table to distibute |
| 118 | +partitions, see below. |
| 119 | + |
| 120 | +There are two tables describing sharded tables (no pun intended) state, shardman.tables and shardman.partitions: |
| 121 | +CREATE TABLE tables ( |
| 122 | + relation text PRIMARY KEY, -- table name |
| 123 | + expr text NOT NULL, |
| 124 | + partitions_count int NOT NULL, |
| 125 | + create_sql text NOT NULL, -- sql to create the table |
| 126 | + -- Node on which table was partitioned at the beginning. Used only during |
| 127 | + -- initial tables inflation to distinguish between table owner and other |
| 128 | + -- nodes, probably cleaner to keep it in separate table. |
| 129 | + initial_node int NOT NULL REFERENCES nodes(id) |
| 130 | +); |
| 131 | +-- Primary shard and its replicas compose a doubly-linked list: nxt refers to |
| 132 | +-- the node containing next replica, prv to node with previous replica (or |
| 133 | +-- primary, if we are the first replica). If prv is NULL, this is primary |
| 134 | +-- replica. We don't number parts separately since we are not ever going to |
| 135 | +-- allow several copies of the same partition on one node. |
| 136 | +CREATE TABLE partitions ( |
| 137 | + part_name text, |
| 138 | + owner int NOT NULL REFERENCES nodes(id), -- node on which partition lies |
| 139 | + prv int REFERENCES nodes(id), |
| 140 | + nxt int REFERENCES nodes(id), |
| 141 | + relation text NOT NULL REFERENCES tables(relation), |
| 142 | + PRIMARY KEY (part_name, owner) |
| 143 | +); |
| 144 | + |
| 145 | +move_part(part_name text, dest int, src int DEFAULT NULL) |
| 146 | +Move shard 'part_name' from node 'dest' to node 'src'. If src is NULL, primary |
| 147 | +shard is moved. Cmd fails if there is already replica of this shard on 'dest'. |
| 148 | + |
| 149 | +create_replica(part_name text, dest int) |
| 150 | +Create replica of shard 'part_name' on node 'dest'. Cmd fails if there is already replica of this shard on 'dest'. |
| 151 | + |
| 152 | +rebalance(relation text) |
| 153 | +Evenly distribute partitions of table 'relation' across all nodes. Currently |
| 154 | +this is pretty dumb function, it just tries to move each shard once to node |
| 155 | +choosen in round-robin manner, completely ignoring current distribution. Since |
| 156 | +dest node can already have replica of this partition, it is not uncommon to see |
| 157 | +warnings about failed moves during execution. After completion cmd status is |
| 158 | +'done', not 'success'. |
12 | 159 |
|
13 |
| -restart postgres server and run |
14 |
| -drop extension if exists pg_shardman; |
15 |
| -create extension pg_shardman; |
| 160 | +set_replevel(relation text, replevel int) |
| 161 | +Add replicas to shards of sharded table 'relation' until we reach replevel |
| 162 | +replicas for each one. Replica deletions is not implemented yet. Note that it is |
| 163 | +pointless to set replevel to more than number of active workers - 1 since we |
| 164 | +don't forbid several replicas on one node. Nodes for replicas are choosen |
| 165 | +randomly. As in 'rebalance', we are fully oblivious about current shards |
| 166 | +distribution, so you will see a bunch of warnings about failing replica |
| 167 | +creation -- one for each time random chooses node with already existing replica. |
16 | 168 |
|
17 |
| -The master itself can't be worker node for now, because it requires special |
18 |
| -handling of LR channels setup. |
| 169 | +Sharded tables dropping, as well as replica deletion is not implemented yet. |
19 | 170 |
|
20 |
| -ALTER TABLE for sharded tables is not supported for now. |
| 171 | +Limitations: |
| 172 | +* We can't switch shardlord for now. |
| 173 | +* The shardlord itself can't be worker node for now. |
| 174 | +* ALTER TABLE for sharded tables is not supported. |
0 commit comments