31920832
31920832
31920832
com
PostgreSQL Replication
Second Edition
Hans-Jürgen Schönig
BIRMINGHAM - MUMBAI
www.allitebooks.com
PostgreSQL Replication
Second Edition
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book
is sold without warranty, either express or implied. Neither the author nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-78355-060-9
www.packtpub.com
www.allitebooks.com
Credits
Reviewers Proofreader
Swathi Kurunji Safis Editing
Jeff Lawson
Maurício Linhares Indexer
Priya Sane
Shaun M. Thomas
Tomas Vondra
Graphics
Sheetal Aute
Commissioning Editor
Kartikey Pandey
Production Coordinator
Komal Ramchandani
Acquisition Editor
Larissa Pinto
Cover Work
Komal Ramchandani
Content Development Editor
Nikhil Potdukhe
Technical Editor
Manali Gonsalves
Copy Editors
Dipti Mankame
Vikrant Phadke
www.allitebooks.com
About the Author
www.allitebooks.com
About the Reviewers
Swathi also has a master's of science degree in computer science from UMass Lowell
and a bachelor's of engineering degree in information science from KVGCE in India.
During her studies at UMass Lowell, she worked as a teaching assistant, helping
professors in teaching classes and labs, designing projects, and grading exams.
She has worked as a software development intern with IT companies such as EMC
and SAP. At EMC, she gained experience on Apache Cassandra data modeling and
performance analysis. At SAP, she gained experience on the infrastructure/cluster
management components of the Sybase IQ product. She has also worked with
Wipro Technologies in India as a project engineer, managing application servers.
She has extensive experience with database systems such as Apache Cassandra,
Sybase IQ, Oracle, MySQL, and MS Access. Her interests include software
design and development, big data analysis, optimization of databases, and cloud
computing. Her LinkedIn profile is http://www.linkedin.com/pub/swathi-
kurunji/49/578/30a/.
Swathi has previously reviewed two books, Cassandra Data Modeling and Analysis
and Mastering Apache Cassandra, both by Packt Publishing.
www.allitebooks.com
Jeff Lawson has been a user and fan of PostgreSQL since he noticed it in 2001. Over
the years, he has also developed and deployed applications for IBM DB2, Oracle,
MySQL, Microsoft SQL Server, Sybase, and others, but he has always preferred
PostgreSQL because of its balance of features and openness. Much of his experience
has spanned development for Internet-facing websites and projects that required
highly scalable databases with high availability or provisions for disaster recovery.
Jeff is fond of cattle, holds an FAA private pilot certificate, and owns an airplane
in Houston, Texas.
Shaun M. Thomas has been working with PostgreSQL since late 2000. He has
presented at Postgres open conferences in 2011, 2012, and 2014 on topics such
as handling extreme throughput, high availability, server redundancy, failover
techniques, and system monitoring. With the recent publication of Packt Publishing's
PostgreSQL 9 High Availability Cookbook, he hopes to make life easier for DBAs using
PostgreSQL in enterprise environments.
Currently, Shaun serves as the database architect at Peak6, an options trading firm
with a PostgreSQL constellation of over 100 instances, one of which is over 15 TB
in size.
He wants to prove that PostgreSQL is more than ready for major installations.
www.allitebooks.com
Tomas Vondra has been working with PostgreSQL since 2003, and although he
had worked with various other databases—both open-source and proprietary—he
instantly fell in love with PostgreSQL and the community around it.
www.allitebooks.com
www.PacktPub.com
Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.PacktPub.
com and as a print book customer, you are entitled to a discount on the eBook copy.
Get in touch with us at service@packtpub.com for more details.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital
book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
• Fully searchable across every book published by Packt
• Copy and paste, print, and bookmark content
• On demand and accessible via a web browser
www.allitebooks.com
Table of Contents
Preface xi
Chapter 1: Understanding the Concepts of Replication 1
The CAP theorem and physical limitations of replication 2
Understanding the CAP theorem 2
Understanding the limits of physics 4
Different types of replication 5
Synchronous versus asynchronous replication 6
Considering performance issues 7
Understanding replication and data loss 7
Single-master versus multimaster replication 8
Logical versus physical replication 9
When to use physical replication 10
When to use logical replication 10
Using sharding and data distribution 11
Understanding the purpose of sharding 11
Designing a sharded system 11
Querying different fields 12
Pros and cons of sharding 13
Choosing between sharding and redundancy 14
Increasing and decreasing the size of a cluster 15
Combining sharding and replication 17
Various sharding solutions 18
PostgreSQL-based sharding 19
Summary 19
[i]
www.allitebooks.com
Table of Contents
[ ii ]
Table of Contents
[ iii ]
Table of Contents
Error scenarios 93
Network connection between the master and slave is dead 93
Rebooting the slave 94
Rebooting the master 94
Corrupted XLOG in the archive 94
Making streaming-only replication more robust 95
Using wal_keep_segments 95
Utilizing replication slots 96
Efficient cleanup and the end of recovery 97
Gaining control over the restart points 97
Tweaking the end of your recovery 98
Conflict management 98
Dealing with timelines 101
Delayed replicas 102
Handling crashes 103
Summary 103
Chapter 5: Setting Up Synchronous Replication 105
Synchronous replication setup 105
Understanding the downside to synchronous replication 106
Understanding the application_name parameter 106
Making synchronous replication work 107
Checking the replication 109
Understanding performance issues 110
Setting synchronous_commit to on 111
Setting synchronous_commit to remote_write 111
Setting synchronous_commit to off 111
Setting synchronous_commit to local 112
Changing durability settings on the fly 112
Understanding the practical implications and performance 113
Redundancy and stopping replication 115
Summary 115
Chapter 6: Monitoring Your Setup 117
Checking your archive 117
Checking archive_command 118
Monitoring the transaction log archive 118
Checking pg_stat_replication 119
Relevant fields in pg_stat_replication 120
Checking for operating system processes 121
Checking for replication slots 122
[ iv ]
Table of Contents
[v]
Table of Contents
[ vi ]
Table of Contents
[ vii ]
Table of Contents
[ viii ]
Table of Contents
[ ix ]
Preface
Since the first edition of PostgreSQL Replication many new technologies have emerged
or improved. In the PostgreSQL community, countless people around the globe have
been working on important techniques and technologies to make PostgreSQL even
more useful and more powerful.
To make sure that readers can enjoy all those new features and powerful tools, I
have decided to write a second, improved edition of PostgreSQL Replication. Due
to the success of the first edition, the hope is to make this one even more useful to
administrators and developers alike around the globe.
All the important new developments have been covered and most chapters have
been reworked to make them easier to understand, more complete and absolutely
up to date.
I hope that all of you can enjoy this book and benefit from it.
[ xi ]
www.allitebooks.com
Preface
Chapter 3, Understanding Point-in-time Recovery, is the next logical step and outlines
how the PostgreSQL transaction log will help you to utilize Point-in-time Recovery
to move your database instance back to a desired point in time.
Chapter 8, Working with PgBouncer, deals with PgBouncer, which is very often used
along with PostgreSQL replication. You will learn how to configure PgBouncer and
boost the performance of your PostgreSQL infrastructure.
Chapter 9, Working with pgpool, covers one more tool capable of handling replication
and PostgreSQL connection pooling.
Chapter 10, Configuring Slony, contains a practical guide to using Slony and shows
how you can use this tool fast and efficiently to replicate sets of tables.
Chapter 11, Using SkyTools, offers you an alternative to Slony and outlines how
you can introduce generic queues to PostgreSQL and utilize Londiste replication
to dispatch data in a large infrastructure.
Chapter 13, Scaling with PL/Proxy, describes how you can break the chains and
scale out infinitely across a large server farm.
Chapter 14, Scaling with BDR, describes the basic concepts and workings of the
BDR replication system. It shows how BDR can be configured and how it operates
as the basis for a modern PostgreSQL cluster.
Chapter 15, Working with Walbouncer, shows how transaction log can be replicated
partially using the walbouncer tool. It dissects the PostgreSQL XLOG and makes
sure that the transaction log stream can be distributed to many nodes in the cluster.
[ xii ]
Preface
Conventions
In this book, you will find a number of text styles that distinguish between different
kinds of information. Here are some examples of these styles and an explanation of
their meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows:
"We can include other contexts through the use of the include directive."
When we wish to draw your attention to a particular part of a code block, the
relevant lines or items are set in bold:
checkpoint_segments = 3
checkpoint_timeout = 5min
checkpoint_completion_target = 0.5
checkpoint_warning = 30s
[ xiii ]
Preface
New terms and important words are shown in bold. Words that you see on the
screen, for example, in menus or dialog boxes, appear in the text like this: "Clicking
the Next button moves you to the next screen."
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about
this book—what you liked or disliked. Reader feedback is important for us as it
helps us develop titles that you will really get the most out of.
If there is a topic that you have expertise in and you are interested in either writing
or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to
help you to get the most from your purchase.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you find a mistake in one of our books—maybe a mistake in the text or
the code—we would be grateful if you could report this to us. By doing so, you can
save other readers from frustration and help us improve subsequent versions of this
book. If you find any errata, please report them by visiting http://www.packtpub.
com/submit-errata, selecting your book, clicking on the Errata Submission Form
link, and entering the details of your errata. Once your errata are verified, your
submission will be accepted and the errata will be uploaded to our website or
added to any list of existing errata under the Errata section of that title.
[ xiv ]
Preface
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all
media. At Packt, we take the protection of our copyright and licenses very seriously.
If you come across any illegal copies of our works in any form on the Internet,
please provide us with the location address or website name immediately so that
we can pursue a remedy.
We appreciate your help in protecting our authors and our ability to bring you
valuable content.
Questions
If you have a problem with any aspect of this book, you can contact us at
questions@packtpub.com, and we will do our best to address the problem.
[ xv ]
Understanding the Concepts
of Replication
Replication is an important issue and, in order to get started, it is highly important
to understand some basic concepts and theoretical ideas related to replication. In this
chapter, you will be introduced to various concepts of replication, and learn which
kind of replication is most suitable for which kind of practical scenario. By the end
of the chapter, you will be able to judge whether a certain concept is feasible under
various circumstances or not.
The goal of this chapter is to give you a lot of insights into the theoretical concepts.
This is truly important in order to judge whether a certain customer requirement
is actually technically feasible or not. You will be guided through the fundamental
ideas and important concepts related to replication.
[1]
Understanding the Concepts of Replication
In this section, you will be taught the so-called CAP theorem. Understanding the
basic ideas of this theorem is essential to avoid some requirements that cannot be
turned into reality.
The CAP theorem was first described by Eric Brewer back in the year 2000. It
has quickly developed into one of the most fundamental concepts in the database
world. Especially with the rise of NoSQL database systems, Brewer's theorem
(as the CAP theorem is often called) has become an important cornerstone of
every distributed system.
• Consistency: This term indicates whether all the nodes in a cluster see
the same data at the same time or not. A read-only node has to see all
previously completed reads at any time.
• Availability: Reads and writes have to succeed all the time. In other
words a node has to be available for users at any point of time.
• Partition tolerance: This means that the system will continue to work even
if arbitrary messages are lost on the way. A network partition event occurs
when a system is no longer accessible (think of a network connection failure).
A different way of considering partition tolerance is to think of it as message
passing. If an individual system can no longer send or receive messages
from other systems, it means that it has been effectively partitioned out of
the network. The guaranteed properties are maintained even when network
failures prevent some machines from communicating with others.
[2]
Chapter 1
Why are these three concepts relevant to normal users? Well, the bad news is that a
replicated (or distributed) system can provide only two out of these three features at
the same time.
Keep in mind that only two out of the three promises can be fulfilled.
Or consider an application collecting a log of weather data from some remote locations.
If the data is a couple of minutes late, it is really no problem. In this case, you might
want to go for cAP. Availability and partition tolerance might really be the
most important things in this case.
Depending on the use, people have to decide what is really important and which
attributes (consistency, availability, or partition tolerance) are crucial and which
can be neglected.
Keep in mind there is no system which can fulfill all those wishes at the same time
(neither open source nor paid software).
[3]
Understanding the Concepts of Replication
We all know that there is some sort of cosmic speed limit called the speed of light.
So why care? Well, let's do a simple mental experiment. Let's assume for a second
that our database server is running at 3 GHz clock speed.
How far can light travel within one clock cycle of your CPU? If you do the math,
you will figure out that light travels around 10 cm per clock cycle (in pure vacuum).
We can safely assume that an electric signal inside a CPU will be very slow
compared to pure light in vacuum. The core idea is, "10 cm in one clock cycle?
Well, this is not much at all."
For the sake of our mental experiment, let's now consider various distances:
Considering the size of a CPU core on a die, you can assume that you can send a
signal (even if it is not traveling anywhere close to the speed of light) from one part
of the CPU to some other part quite fast. It simply won't take 1 million clock cycles
to add up two numbers that are already in your first level cache on your CPU.
But what happens if you have to send a signal from one server to some other server
and back? You can safely assume that sending a signal from server A to server B next
door takes a lot longer because the cable is simply a lot longer. In addition to that,
network switches and other network components will add some latency as well.
Let's talk about the length of the cable here, and not about its bandwidth.
Sending a message (or a transaction) from Europe to China is, of course, many times
more time-consuming than sending some data to a server next door. Again, the
important thing here is that the amount of data is not as relevant as the so-called
latency, consider the following criteria:
[4]
Chapter 1
The most important point you have to keep in mind here is that bandwidth is not
always the magical fix to a performance problem in a replicated environment. In
many setups, latency is at least as important as bandwidth.
[5]
www.allitebooks.com
Understanding the Concepts of Replication
What does this mean? Let's assume we have two servers and we want to replicate
data from one server (the master) to the second server (the slave). The following
diagram illustrates the concept of synchronous and asynchronous replication:
We can use a simple transaction like the one shown in the following:
BEGIN;
INSERT INTO foo VALUES ('bar');
COMMIT;
In the case of asynchronous replication, the data can be replicated after the
transaction has been committed on the master. In other words, the slave is never
ahead of the master; and in the case of writing, it is usually a little behind the
master. This delay is called lag.
[6]
Chapter 1
Some systems will also use a quorum server to decide. So, it is not
always about just two or more servers. If a quorum is used, more
than half of the servers must agree on an action inside the cluster.
Let's assume that we are replicating data asynchronously in the following manner:
In the case of asynchronous replication, there is a window (lag) during which data
can essentially be lost. The size of this window might vary, depending on the type of
setup. Its size can be very short (maybe as short as a couple of milliseconds) or long
(minutes, hours, or days). The important fact is that data can be lost. A small lag will
only make data loss less likely, but any lag larger than zero lag is susceptible to data
loss. If data can be lost, we are about to sacrifice the consistency part of CAP (if two
servers don't have the same data, they are out of sync).
[7]
Understanding the Concepts of Replication
If you want to make sure that data can never be lost, you have to switch to
synchronous replication. As you have already seen in this chapter, a synchronous
transaction is synchronous because it will be valid only if it commits to at least
two servers.
"Single-master" means that writes can go to exactly one server, which distributes
the data to the slaves inside the setup. Slaves may receive only reads, and no writes.
The ability to write to any node inside the cluster sounds like an advantage, but it is
not necessarily one. The reason for this is that multimaster replication adds a lot of
complexity to the system. In the case of only one master, it is totally clear which data
is correct and in which direction data will flow, and there are rarely conflicts during
replication. Multimaster replication is quite different, as writes can go to many nodes
at the same time, and the cluster has to be perfectly aware of conflicts and handle
them gracefully. An alterative would be to use locks to solve the problem, but this
approach will also have its own challenges.
[8]
Chapter 1
• Physical replication means that the system will move data as is to the
remote box. So, if something is inserted, the remote box will get data
in binary format, not via SQL.
• Logical replication means that a change, which is equivalent to data
coming in, is replicated.
INSERT 0 1
We see two transactions going on here. The first transaction creates a table. Once
this is done, the second transaction adds a simple date to the table and commits.
In the case of logical replication, the change will be sent to some sort of queue in
logical form, so the system does not send plain SQL, but maybe something such
as this:
test=# INSERT INTO t_test VALUES ('2013-02-08');
INSERT 0 1
Note that the function call has been replaced with the real value. It would be a total
disaster if the slave were to calculate now() once again, because the date on the
remote box might be a totally different one.
[9]
Understanding the Concepts of Replication
Physical replication will work in a totally different way; instead of sending some SQL
(or something else) over, which is logically equivalent to the changes made, the system
will send binary changes made by PostgreSQL internally.
Here are some of the binary changes our two transactions might have triggered
(but by far, this is not a complete list):
1. Added an 8 K block to pg_class and put a new record there (to indicate
that the table is present).
2. Added rows to pg_attribute to store the column names.
3. Performed various changes inside the indexes on those tables.
4. Recorded the commit status, and so on.
The goal of physical replication is to create a copy of your system that is (largely)
identical on the physical level. This means that the same data will be in the same
place inside your tables on all boxes. In the case of logical replication, the content
should be identical, but it makes no difference whether it is in the same place or not.
In many setups, physical replication is the standard method that exposes the end
user to the lowest complexity possible. It is ideal for scaling out the data.
Logical replication allows decoupling of the way data is stored from the way it is
transported and replicated. Using a neutral protocol, which is not bound to any
specific version of PostgreSQL, it is easy to jump from one version to the next.
Since PostgreSQL 9.4, there is something called Logical Decoding. It allows users
to extract internal changes sent to the XLOG as SQL again. Logical decoding will
be needed for a couple of replication techniques outlined in this book.
[ 10 ]
Chapter 1
Clearly, at some point, you cannot buy servers that are big enough to handle an
infinite load anymore. It is simply impossible to run a Facebook- or Google-like
application on a single server. At some point, you have to come up with a scalability
strategy that serves your needs. This is when sharding comes into play.
The idea of sharding is simple: What if you could split data in a way that it can
reside on different nodes?
[ 11 ]
Understanding the Concepts of Replication
As you can see in our diagram, we have nicely distributed the data. Once this is
done, we can send a query to the system, as follows:
SELECT * FROM t_user WHERE id = 4;
The client can easily figure out where to find the data by inspecting the filter in
our query. In our example, the query will be sent to the first node because we are
dealing with an even number.
As we have distributed the data based on a key (in this case, the user ID), we can
search for any person easily if we know the key. In large systems, referring to users
through a key is a common practice, and therefore, this approach is suitable. By using
this simple approach, we have also easily doubled the number of machines in
our setup.
When you are trying to break up data and store it on different hosts, always make
sure that you are using a proper partitioning function. It can be very beneficial to
split data in such a way that each host has more or less the same amount of data.
Splitting users alphabetically might not be a good idea. The reason for that is that
not all the letters are equally likely. We cannot simply assume that the letters from
A to M occur as often as the letters from N to Z. This can be a major issue if you
want to distribute a dataset to a thousand servers instead of just a handful of
machines. As stated before, it is essential to have a proper partitioning function,
that produces evenly distributed results.
In many cases, a hash function will provide you with nicely and evenly
distributed data. This can especially be useful when working with
character fields (such as names, e-mail addresses, and so on).
[ 12 ]
Chapter 1
Remember that we have distributed data using the ID. In our query, however, we
are searching for the name. The application will have no idea which partition to use
because there is no rule telling us what is where.
As a logical consequence, the application has to ask every partition for the name
parameter. This might be acceptable if looking for the name was a real corner case;
however, we cannot rely on this fact. Requiring to ask many servers instead of one
is clearly a serious deoptimization, and therefore, not acceptable.
We have two ways to approach the problem: coming up with a cleverer partitioning
function, or storing the data redundantly.
Coming up with a cleverer partitioning function would surely be the best option,
but it is rarely possible if you want to query different fields.
This leaves us with the second option, which is storing data redundantly. Storing a
set of data twice, or even more often, is not too uncommon, and it's actually a good
way to approach the problem. The following diagram shows how this can be done:
As you can see, we have two clusters in this scenario. When a query comes in, the
system has to decide which data can be found on which node. For cases where the
name is queried, we have (for the sake of simplicity) simply split the data into half
alphabetically. In the first cluster, however, our data is still split by user ID.
Each practical use case is different, and there is no replacement for common sense
and deep thinking.
Light and shade tend to go together, and therefore sharding also has its downsides:
• Adding servers on the fly can be far from simple (depending on the type
of partitioning function)
• Your flexibility might be seriously reduced
• Not all types of queries will be as efficient as they would be on a single server
• There is an increase in overall complexity of the setup (such as failover,
resyncing, maintenance and so on)
• Backups need more planning
• You might face redundancy and additional storage requirements
• Application developers need to be aware of sharding to make sure that
efficient queries are written
In Chapter 13, Scaling with PL/Proxy, we will discuss how you can efficiently use
sharding along with PostgreSQL, and how to set up PL/Proxy for maximum
performance and scalability.
[ 14 ]
Chapter 1
We might be able to partition the t_user table nicely and split it in such a way that it
can reside on a reasonable number of servers. But what about the t_language table?
Our system might support as many as 10 languages.
It can make perfect sense to shard and distribute hundreds of millions of users, but
splitting up 10 languages? This is clearly useless. In addition to all this, we might
need our language table on all the nodes so that we can perform joins.
One solution to the problem is simple: you need a full copy of the language table on
all the nodes. This will not cause a storage-consumption-related problem because
the table is so small. Of course, there are many different ways to attack the problem.
Make sure that only large tables are sharded. In the case of small
tables, full replicas of the tables might just make much more sense.
A commonly made mistake is that people tend to increase the size of their setup in
unnecessarily small steps. Somebody might want to move from five to maybe six or
seven machines. This can be tricky. Let's assume for a second that we have split data
using user id % 5 as the partitioning function. What if we wanted to move to user
id % 6? This is not so easy; the problem is that we have to rebalance the data inside
our cluster to reflect the new rules.
[ 15 ]
www.allitebooks.com
Understanding the Concepts of Replication
Remember that we have introduced sharding (that is, partitioning) because we have
so much data and so much load that one server cannot handle the requests anymore.
Now, if we come up with a strategy that requires rebalancing of data, we are already
on the wrong track. You definitely don't want to rebalance 20 TB of data just to add
two or three servers to your existing system.
Practically, it is a lot easier to simply double the number of partitions. Doubling your
partitions does not require rebalancing of data because you can simply follow the
strategy outlined here:
Instead of just doubling your cluster (which is fine for most cases), you can also
give more thought to writing a more sophisticated partitioning function that leaves
the old data in place but handles the more recent data more intelligently. Having
time-dependent partitioning functions might cause issues of its own, but it might
be worth investigating this path.
If you expect your cluster to grow, we recommend starting with more partitions
than those initially necessary, and packing more than just one partition on a single
server. Later on, it will be easy to move single partitions to additional hardware
joining the cluster setup. Some cloud services are able to do that, but those aspects
are not covered in this book.
To shrink your cluster again, you can simply apply the opposite strategy and move
more than just one partition to a single server. This leaves the door for a future
increase of servers wide open, and can be done fairly easily.
[ 16 ]
Chapter 1
Let's assume we have three servers (A, B, and C). What happens in consistent hashing
is that an array of, say, 1,000 slots can be created. Then each server name is hashed a
couple of times and entered in the array. The result might look like this:
43 B, 153 A, 190 C, 340 A, 450 C, 650 B, 890 A, 930 C, 980 B
Each value shows up multiple times. In the case of insertion, we take the input key
and calculate a value. Let's assume hash (some input value) equals 58. The result will
go to server A. Why? There is no entry for 58, so the system moves forward in the
list, and the first valid entry is 153, which points to A. If the hash value returned 900,
the data would end up on C. Again, there is no entry for 900 so the system has to
move forward in the list until something is found.
If a new server is added, new values for the server will be added to the array (D
might be on 32, 560, 940, or so). The system has to rebalance some data, but of
course, not all of the data. It is a major advantage over a simple hashing mechanism,
such as a simple key % server_number function. Reducing the amount of data to
be resharded is highly important.
The main advantage of consistent hashing is that it scales a lot better than
simple approaches.
The more servers you have in your setup, the more likely it will be that one of
those servers will be down or not available for some other reason.
[ 17 ]
Understanding the Concepts of Replication
One server is never enough to provide us with High Availability. Every system
needs a backup system that can take over in the event of a serious emergency. By
just splitting a set of data, we definitely do not improve availability. This is because
we have more servers, which can fail at this point. To fix this problem, we can add
replicas to each of our partitions (shards), just as is shown in the following diagram:
Keep in mind that you can choose from the entire arsenal of features and options
discussed in this book (for example, synchronous and asynchronous replication).
All the strategies outlined in this book can be combined flexibly. A single technique
is usually not enough, so feel free to combine various technologies in different ways
to achieve your goals.
[ 18 ]
Chapter 1
When implementing sharding, you can basically choose between two strategies:
In the following chapters, we will discuss both options briefly. This little overview
is not meant to be a comprehensive guide, but rather an overview to get you started
with sharding.
PostgreSQL-based sharding
PostgreSQL cannot shard data out of the box, but it has all the interfaces and means
required to allow sharding through add-ons. One of these add-ons, which is widely
used, is called PL/Proxy. It has been around for many years, and offers superior
transparency as well as good scalability. It was originally developed by Skype to
scale up their infrastructure.
The idea behind PL/Proxy is basically to use a local virtual table to hide an array
of servers making up the table.
Summary
In this chapter, you learned about basic replication-related concepts as well as
the physical limitations. We dealt with the theoretical concepts, which are the
groundwork for what is still to come in this book.
In the next chapter, you will be guided through the PostgreSQL transaction log,
and we will outline all important aspects of this vital component. You will learn
what the transaction log is good for and how it can be applied.
[ 19 ]
Understanding the
PostgreSQL Transaction Log
In the previous chapter, we dealt with various replication concepts. That was meant
to be more of a theoretical overview to sharpen your senses for what is to come,
and was supposed to introduce the topic in general.
In this chapter, we will move closer to practical solutions, and you will learn how
PostgreSQL works internally and what it means for replication. We will see what
the so-called transaction log (XLOG) does and how it operates. The XLOG is the
very backbone of the PostgreSQL onboard replication machinery. It is essential to
understand how this part works in order to fully understand replication. You will
learn the following topics in this chapter:
In this chapter, you will also learn about replication slots and logical decoding.
Once you have completed reading this chapter, you will be ready to understand
the next chapter, which will teach you how to safely replicate your first database.
[ 21 ]
Understanding the PostgreSQL Transaction Log
[ 22 ]
Chapter 2
You will see a range of files and directories, which are needed to run a database
instance. Let's take a look at them in detail.
[hs@paula base]$ ls -l
total 24
drwx------ 2 hs hs 8192 Dec 9 10:58 1
drwx------ 2 hs hs 8192 Dec 9 10:58 12180
drwx------ 2 hs hs 8192 Feb 11 14:29 12185
drwx------ 2 hs hs 4096 Feb 11 18:14 16384
We can easily link these directories to the databases in our system. It is worth
noticing that PostgreSQL uses the object ID of the database here. This has many
advantages over using the name, because the object ID never changes and offers
a good way to abstract all sorts of problems, such as issues with different character
sets on the server:
test=# SELECT oid, datname FROM pg_database;
oid |datname
-------+-----------
1 | template1
[ 23 ]
Understanding the PostgreSQL Transaction Log
12180 | template0
12185 | postgres
16384 | test
(4 rows)
Now we can see how data is stored inside those database-specific directories.
In PostgreSQL, each table is related to (at least) one data file. Let's create a table
and see what happens:
test=# CREATE TABLE t_test (id int4);
CREATE TABLE
We can check the system table now to retrieve the so-called relfilenode variable,
which represents the name of the storage file on the disk:
test=# SELECT relfilenode, relname
FROM pg_class
WHERE relname = 't_test';
relfilenode | relname
-------------+---------
16385 | t_test
(1 row)
As soon as the table is created, PostgreSQL will create an empty file on the disk:
[hs@paula base]$ ls -l 16384/16385*
-rw------- 1 hs staff 0 Feb 12 12:06 16384/16385
So, if the file called 16385 grows beyond 1 GB, there will be a file called 16385.1;
once this has been filled, you will see a file named 16385.2; and so on. In this way,
a table in PostgreSQL can be scaled up reliably and safely without having to worry
too much about the underlying filesystem limitations of some rare operating systems
or embedded devices (Most modern filesystems can handle a large number of files
efficiently. However, not all filesystems have been created as equals.)
[ 24 ]
Chapter 2
Relation forks
Other than the data files discussed in the previous paragraph, PostgreSQL will
create additional files using the same relfilenode number. Till now, those files
have been used to store information about free space inside a table (Free Space
Map), the so-called Visibility Map file, and so on. In future, more types of
relation forks might be added.
If the commit log of a database instance is broken for some reason (maybe because
of filesystem corruption), you can expect some funny hours ahead.
[ 25 ]
www.allitebooks.com
Understanding the PostgreSQL Transaction Log
[ 26 ]
Chapter 2
[ 27 ]
Understanding the PostgreSQL Transaction Log
The important thing here is that if a database instance is fully replicated, we simply
cannot rely on the fact that all the servers in the cluster use the same disk layout
and the same storage hardware. There can easily be scenarios in which a master
needs a lot more I/O power than a slave, which might just be around to function as
backup or standby. To allow users to handle different disk layouts, PostgreSQL will
place symlinks in the pg_tblspc directory. The database will blindly follow those
symlinks to find the tablespaces, regardless of where they are.
This gives end users enormous power and flexibility. Controlling storage is essential
to both replication as well as performance in general. Keep in mind that those
symlinks can only be changed post transaction(users can do that if a slave server in
a replicated setup does not use the same filesystem layout). This should be carefully
thought over.
[ 28 ]
Chapter 2
What you see is a bunch of files that are always exactly 16 MB in size (the default
setting). The filename of an XLOG file is generally 24 bytes long. The numbering is
always hexadecimal. So, the system will count "… 9, A, B, C, D, E, F, 10" and so on.
One important thing to mention is that the size of the pg_xlog directory will not
vary wildly over time, and it is totally independent of the type of transactions
you are running on your system. The size of the XLOG is determined by the
postgresql.conf parameters, which will be discussed later in this chapter. In short,
no matter whether you are running small or large transactions, the size of the XLOG
will be the same. You can easily run a transaction as big as 1 TB with just a handful
of XLOG files. This might not be too efficient performance wise, but it is technically
perfectly feasible.
If you happen to use prebuilt binaries, you might not find postgresql.
conf directly inside your data directory. It is more likely to be located in
some subdirectory of /etc/ (on Linux/Unix) or in your place of choice
in Windows. The precise location is highly dependent on the type of
operating system you are using. The typical location of data directories is
/var/lib/pgsql/data, but postgresql.conf is often located under
/etc/postgresql/9.X/main/postgresql.conf (as in Ubuntu and
similar systems), or under /etc directly.
[ 29 ]
Understanding the PostgreSQL Transaction Log
Note that in this section, which is about writing a row of data, we have simplified
the process a little to make sure that we can stress the main point and the ideas
behind the PostgreSQL XLOG.
As one might imagine, the goal of an INSERT operation is to somehow add a row to
an existing table. We have seen in the previous section—the section about the disk
layout of PostgreSQL—that each table will be associated with a file on the disk.
Let's perform a mental experiment and assume that the table we are dealing with
here is 10 TB large. PostgreSQL will see the INSERT operation and look for some
spare place inside this table (either using an existing block or adding a new one).
For the purpose of this example, we have simply put the data into the second
block of the table.
Everything will be fine as long as the server actually survives the transaction. What
happens if somebody pulls the plug after just writing abc instead of the entire data?
When the server comes back up after the reboot, we will find ourselves in a situation
where we have a block with an incomplete record, and to make it even funnier, we
might not even have the slightest idea where this block containing the broken record
might be.
[ 30 ]
Chapter 2
To fix the problem that we have just discussed, PostgreSQL uses the so-called WAL
or simply XLOG. Using WAL means that a log is written ahead of data. So, before we
actually write data to the table, we make log entries in a sequential order, indicating
what we are planning to do to our underlying table. The following diagram shows
how things work in WAL:
As we can see from this diagram, once we have written data to the log in (1), we can
go ahead and mark the transaction as done in (2). After that, data can be written to
the table, as marked with (3).
We have left out the memory part of the equation; this will be
discussed later in this section.
[ 31 ]
Understanding the PostgreSQL Transaction Log
As soon as PostgreSQL starts up, it can go through the XLOG and replay everything
necessary to make sure that PostgreSQL is in a consistent state. So, if we don't make
it through WAL-writing, it means that something nasty has happened and we cannot
expect a write to be successful.
A WAL entry will always know whether it is complete or not. Every WAL entry
has a checksum inside, and therefore PostgreSQL can instantly detect problems if
somebody tries to replay a broken WAL. This is especially important during a crash,
when we might not be able to rely on the very latest data written to the disk. The
WAL will automatically sort those problems during crash recovery.
Well, in this case, we will know during replay what is missing. Again, we go to
the WAL and replay what is needed to make sure that all of the data is safely in
our data table.
While it might be hard to find data in a table after a crash, we can always rely on
the fact that we can find data in the WAL. The WAL is sequential, and if we simply
keep a track of how much data has been written, we can always continue from there.
The XLOG will lead us directly to the data in the table, and it always knows where
a change has been made or should have been made. PostgreSQL does not have to
search for data in the WAL; it just replays it from the proper point onward. Keep in
mind that replaying XLOG is very efficient and can be a lot faster than the original
write to the master.
[ 32 ]
Chapter 2
Read consistency
Now that we have seen how a simple write is performed, we will see what
impact writes have on reads. The next diagram shows the basic architecture
of the PostgreSQL database system:
For the sake of simplicity, we can see a database instance as an entity consisting of
three major components:
In the previous sections, we have already discussed data files. You have also read
some basic information about the transaction log itself. Now we have to extend our
model and bring another component into the scenery—the memory component of
the game, which is the shared buffer.
[ 33 ]
Understanding the PostgreSQL Transaction Log
However, performance is not the only issue we should be focused on when it comes
to the shared buffer. Let's assume that we want to issue a query. For the sake of
simplicity, we also assume that we need just one block to process this read request.
What happens if we perform a simple read? Maybe we are looking for something
simple, such as a phone number or a username, given a certain key. The following
list shows in a highly simplified way what PostgreSQL will do under the assumption
that the instance has been freshly restarted:
1. PostgreSQL will look up the desired block in the cache (as stated before,
this is the shared buffer). It will not find the block in the cache of a freshly
started instance.
2. PostgreSQL will ask the operating system for the block.
3. Once the block has been loaded from the OS, PostgreSQL will put it into
the first queue of the cache.
4. The query will be served successfully.
Let's assume that the same block will be used again, this time by a second query.
In this case, things will work as follows:
1. PostgreSQL will look up the desired block and come across a cache hit.
2. Then PostgreSQL will figure out that a cached block has been reused, and
move it from a lower level of cache (Q1) to a higher level of cache (Q2).
Blocks that are in the second queue will stay in the cache longer, because they
have proven to be more important than those that are only at the Q1 level.
[ 34 ]
Chapter 2
How large should the shared buffer be? Under Linux, a value of up to 8
GB is usually recommended. On Windows, values below 1 GB have been
proven to be useful (as of PostgreSQL9.2). From PostgreSQL 9.3 onwards,
higher values might be useful and feasible under Windows. Insanely
large shared buffers on Linux can actually be a deoptimization. Of course,
this is only a rule of thumb; special setups might need different settings.
Also keep in mind that some work goes on in PostgreSQL constantly and
the best practices might vary over time.
What is the point of this example? Well! As you might have noticed, we have never
talked about actually writing to the underlying table. We talked about writing to the
cache, to the XLOG, and so on, but never about the real data file.
In this example, whether the row we have written is in the table or not
is totally irrelevant. The reason is simple; if we need a block that has just
been modified, we will never make it to the underlying table anyway.
It is important to understand that data is usually not sent to a data file directly
after or during a write operation. It makes perfect sense to write data a lot later to
increase efficiency. The reason this is important is that it has subtle implications on
replication. A data file itself is worthless because it is neither necessarily complete
nor correct. To run a PostgreSQL instance, you will always need data files along with
the transaction log. Otherwise, there is no way to survive a crash.
[ 35 ]
www.allitebooks.com
Understanding the PostgreSQL Transaction Log
From a consistency point of view, the shared buffer is here to complete the view a
user has of the data. If something is not in the table, logically, it has to be in memory.
In the event of a crash, the memory will be lost, and so the XLOG will be consulted
and replayed to turn data files into a consistent data store again. Under all
circumstances, data files are only half of the story.
This triplet is a unique identifier for any data-carrying object in the database system.
Depending on the type of operation, various types of records are used (commit
records, B-tree changes, heap changes, and so on).
[ 36 ]
Chapter 2
In general, the XLOG is a stream of records lined up one after the other. Each record
is identified by the location in the stream. As already mentioned in this chapter, a
typical XLOG file is 16 MB in size (unless changed at compile time). Inside those 16
MB segments, data is organized in 8 K blocks. XLOG pages contain a simple header
consisting of:
In addition to this, each segment (16 MB file) has a header consisting of various
fields as well:
• System identifier
• Segment size and block size
The segment and block size are mostly available to check the correctness of the file.
XLOG addresses are highly critical and must not be changed, otherwise the entire
system breaks down.
The data structure outlined in this chapter is as of PostgreSQL 9.4. It is quite likely
that changes will happen in PostgreSQL 9.5 and beyond.
[ 37 ]
Understanding the PostgreSQL Transaction Log
The idea of using this set of changes to replicate data is not farfetched. In fact, it is a
logical step in the development of every relational (or maybe even a nonrelational)
database system. For the rest of this book, you will see in many ways how the
PostgreSQL transaction log can be used, fetched, stored, replicated, and analyzed.
In most replicated systems, the PostgreSQL transaction log is the backbone of the
entire architecture (for synchronous as well as for asynchronous replication).
So far, we have mostly talked about corruption. It is definitely not good to lose data
files because of corrupted entries in them, but corruption is not the only issue you
have to be concerned about. Two other important topics are:
• Performance
• Data loss
While these might be an obvious choice for important topics, we have the feeling
that they are not well understood and honored. Therefore, they have been taken
into consideration.
The awareness of potential data loss, or even a concept to handle it, seems to be new
to many people. Let's try to put it this way: what good is higher speed if data is lost
even faster? The point of this is not that performance is not important; performance
is highly important. However, we simply want to point out that performance is not
the only component in the bigger picture.
[ 38 ]
Chapter 2
In our example, we have used four layers. In many enterprise systems, there can
be even more layers. Just imagine a virtual machine with storage mounted over
the network, such as SAN, NAS, NFS, ATA-over_Ethernet, iSCSI, and so on.
Many abstraction layers will pass data around, and each of them will try to do
its share of optimization.
[ 39 ]
Understanding the PostgreSQL Transaction Log
The following code snippet illustrates the basic problem we are facing:
test=# \d t_test
Table "public.t_test"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
test=# BEGIN;
BEGIN
test=# INSERT INTO t_test VALUES (1);
INSERT 0 1
test=# COMMIT;
COMMIT
Just as in the previous chapter, we are using a table with only one column. The
goal is to run a transaction inserting a single row.
If a crash happens shortly after COMMIT, no data will be in danger because nothing
has happened. Even if a crash happens shortly after the INSERT statement but
before COMMIT, nothing can happen. The user has not issued a COMMIT command
yet, so the transaction is known to be running and thus unfinished. If a crash occurs,
the application will notice that things were unsuccessful and (hopefully) react
accordingly. Also keep in mind that every transaction that is not committed will
eventually end up as ROLLBACK.
However, the situation is quite different if the user has issued a COMMIT statement
and it has returned successfully. Whatever happens, the user will expect the
committed data to be available.
[ 40 ]
Chapter 2
PostgreSQL does not have to force data to the data files at this point
because we can always repair broken data files from the XLOG. If data
is stored in the XLOG safely, the transaction can be considered safe.
The system call necessary to force data to the disk is called fsync(). The following
listing has been copied from the BSD manual page. In our opinion, it is one of the
best manual pages ever written dealing with the topic:
FSYNC(2) BSD System Calls Manual FSYNC(2)
NAME
fsync -- synchronize a file's in-core state with
that on disk
SYNOPSIS
#include <unistd.h>
int
fsync(intfildes);
DESCRIPTION
Fsync() causes all modified data and attributes of
fildes to be moved to a permanent storage device.
This normally results in all in-core modified
copies of buffers for the associated file to be
written to a disk.
[ 41 ]
Understanding the PostgreSQL Transaction Log
It essentially says that the kernel tries to make its image of the file in the memory
consistent with the image of the file on the disk. It does so by forcing all changes to
the storage device. It is also clearly stated that we are not talking about a theoretical
scenario here; flushing to disk is a highly important issue.
Without a disk flush on COMMIT, you just cannot be sure that your data is safe, and
this means that you can actually lose data in the event of some serious trouble.
Also, what is essentially important is speed and consistency; they can actually work
against each other. Flushing changes to the disk is especially expensive because real
hardware is involved. The overhead we have is not some five percent but a lot more.
With the introduction of SSDs, the overhead has gone down dramatically but it is
still substantial.
[ 42 ]
Chapter 2
The fsync parameter will control data loss, if it is used at all. In the default
configuration, PostgreSQL will always flush a commit out to the disk. If fsync is off,
however, there is no guarantee at all that a COMMIT statement will survive a crash.
Data can be lost and there might even be data corruption. To protect all of your data,
it is necessary to keep fsync at on. If you can afford to lose some or all of your data,
you can relax flushing standards a little.
• on: PostgreSQL will wait until XLOG has been fully and successfully written.
If you are storing credit card data, you would want to make sure that a
financial transaction is not lost. In this case, flushing to the disk is essential.
• off: There will be a time difference between reporting success to the client
and safely writing to the disk. In a setting like this, there can be corruption.
Let's assume a database that stores information about who is currently online
on a website. Suppose your system crashes and comes back up 20 minutes
later. Do you really care about your data? After 20 minutes, everybody has
to log in again anyway, and it is not worth sacrificing performance to protect
data that will be outdated in a couple of minutes anyway.
[ 43 ]
Understanding the PostgreSQL Transaction Log
• local: In the case of a replicated database instance, we will wait only for the
local instance to flush to the disk. The advantage here is that you have a high
level of protection because you flush to one disk. However, we can safely
assume that not both servers crash at the same time, so we can relax the
standards on the slave a little.
• remote_write: PostgreSQL will wait until a synchronous standby server
reports success for a given transaction.
Keep in mind that configuring a database is not just about speed. Consistency is at
least as important as speed, and therefore, you should think carefully whether you
want to trade speed for potential data loss.
[ 44 ]
Chapter 2
To solve this problem, the XLOG has to be deleted at some point. This process is
called checkpointing.
The main question arising from this issue is, "When can the XLOG be truncated up to
a certain point?" The answer is, "When PostgreSQL has put everything that is already
in the XLOG into the storage files." If all the changes made to the XLOG
are also made to the data files, the XLOG can be truncated.
Keep in mind that simply writing the data is worthless. We also have
to flush the data to the data tables.
In a way, the XLOG can be seen as the repairman of the data files if something
undesirable happens. If everything is fully repaired, the repair instructions can
be removed safely; this is exactly what happens during a checkpoint.
[ 45 ]
www.allitebooks.com
Understanding the PostgreSQL Transaction Log
Remember that a segment is usually 16 MB, so three segments means that we will
perform a checkpoint every 48 MB. On modern hardware, 16 MB is far from enough.
In a typical production system, a checkpoint interval of 256 MB (measured in
segments) or even higher is perfectly feasible.
In PostgreSQL, you will figure out that there are a constant number of
transaction log files around. Unlike other database systems, the number
of XLOG files has nothing to do with the maximum size of a transaction;
a transaction can easily be much larger than the distance between two
checkpoints.
So if the data files don't have to be consistent anyway, why not vary the point in
time at which the data is written? This is exactly what we can do with the
checkpoint_completion target. The idea is to have a setting that specifies the target
of checkpoint completion as a fraction of the total time between two checkpoints.
Let's now discuss three scenarios to illustrate the purpose of the checkpoint_
completion_target parameter.
[ 46 ]
Chapter 2
Given the type of data we are dealing with, we can assume that we will have a
workload that is dictated by UPDATE statements.
What will happen now? PostgreSQL has to update the same data over and over
again. Given the fact that the DJIA consists of only 30 different stocks, the amount
of data is very limited and our table will be really small. In addition to this, the price
might be updated every second or even more often.
Internally, the situation is like this: when the first UPDATE command comes along,
PostgreSQL will grab a block, put it into the memory, and modify it. Every
subsequent UPDATE command will most likely change the same block. Logically,
all writes have to go to the transaction log, but what happens with the cached
blocks in the shared buffer?
The general rule is as follows: if there are many UPDATE commands (and as a result,
changes made to the same block), it is wise to keep the blocks in memory as long
as possible. This will greatly increase the odds of avoiding I/O by writing multiple
changes in one go.
If you want to increase the odds of having many changes in one disk
I/O, consider decreasing checkpoint_complection_target. The
blocks will stay in the memory longer, and therefore many changes
might go into the same block before a write occurs.
During a bulk load, we would want to use all of the I/O capacity we have all the
time. To make sure that PostgreSQL writes data instantly, we have to increase
checkpoint_completion_target to a value close to 1.
[ 47 ]
Understanding the PostgreSQL Transaction Log
In this scenario, we want to assume an application storing the so-called Call Detail
Records (CDRs) for a phone company. You can imagine that a lot of writing will
happen and people will be placing phone calls all day long. Of course, there will be
people placing a phone call that is instantly followed by the next call, but we will
also witness a great number of people placing just one call a week or so.
Technically, this means that there is a good chance that a block in the shared memory
that has recently been changed will face a second or a third change soon, but we will
also have a great deal of changes made to blocks that will never be visited again.
How will we handle this? Well, it is a good idea to write data late so that as many
changes as possible will go to pages that have been modified before. But what will
happen during a checkpoint? If changes (in this case, dirty pages) have been held
back for too long, the checkpoint itself will be intense, and many blocks will have to
be written within a fairly short period of time. This can lead to an I/O spike. During
an I/O spike, you will see that your I/O system is busy. It might show poor response
times, and those poor response times can be felt by your end user.
Let's put it in this way: assume you have used internet banking successfully for quite
a while. You are happy. Now some guy at your bank has found a tweak that makes
the database behind the internet banking 50 percent faster, but this gain comes with
a downside: for two hours a day, the system will not be reachable. Clearly, from a
performance point of view, the throughput will be better, as shown in this inequation:
But are you, the customer, going to be happy? Clearly, you would not be. This is
a typical use case where optimizing for maximum throughput does no good. If
you can meet your performance requirements, it might be wiser to have evenly
distributed response times at the cost of a small performance penalty. In our banking
example, this would mean that your system is up 24x7 instead of just 22 hours a day.
The same concept applies to the phone application we have outlined. We are not able
to write all changes during the checkpoint anymore, because this might cause latency
issues during a checkpoint. It is also no good to make a change to the data files more
or less instantly (which means a high checkpoint_completion_target value),
because we would be writing too much, too often.
[ 48 ]
Chapter 2
Conclusion
The conclusion that should be drawn from these three examples is that no
configuration fits all purposes. You really have to think about the type of data
you are dealing with in order to come up with a good and feasible configuration.
For many applications, a value of 0.5 has been proven to be just fine.
[ 49 ]
Understanding the PostgreSQL Transaction Log
In this example, we are inserting values into a table containing two columns. For the
sake of this example, we want to assume that both columns are indexed.
Remember what you learned before: the purpose of the XLOG is to keep those data
files safe. So this operation will trigger a series of XLOG entries. First, the data file (or
files) related to the table will be written. Then the entries related to the indexes will
be created. Finally, a COMMIT record will be sent to the log.
Not all the XLOG records are equal. Various types of XLOG records exist, for
example heap, B-tree, clog, storage, Generalized Inverted Index (GIN), and
standby records, to name a few.
XLOG records are chained backwards so, each entry points to the previous entry in
the file. In this way, we can be perfectly sure that we have found the end of a record
as soon as we have found the pointer to the previous entry.
The random() function has to produce a different output every time it is called, and
therefore we cannot just put the SQL into the log because it is not guaranteed to
provide us with the same outcome if it is executed during replay.
First of all, each XLOG record contains a CRC32 checksum. This allows us to check
the integrity of the log at startup. It is perfectly possible that the last write operations
before a crash were not totally sound any more, and therefore a checksum can
definitely help sort problems straightaway. The checksum is automatically computed
by PostgreSQL and users don't have to take care of this feature explicitly.
Finally, PostgreSQL uses a fixed-size XLOG. The size of the XLOG is determined
by checkpoint segments as well as by checkpoint_completion_target.
(2 + checkpoint_completion_target) * checkpoint_segments + 1
An important thing to note is that if something is of fixed size, it can rarely run
out of space.
You will learn more about this topic in the next chapter.
[ 51 ]
Understanding the PostgreSQL Transaction Log
In addition to this, we have to set client_min_messages to make sure that the LOG
messages will reach our client.
[ 52 ]
Chapter 2
Only if PostgreSQL has been compiled properly will we see some information about
the XLOG on the screen:
test=# INSERT INTO t_test VALUES (1, 'hans');
LOG: INSERT @ 0/17C4680: prev 0/17C4620; xid 1009; len 36: Heap -
insert(init): rel 1663/16384/16394; tid 0/1
LOG: INSERT @ 0/17C46C8: prev 0/17C4680; xid 1009; len 20: Btree -
newroot: rel 1663/16384/16397; root 1 lev 0
LOG: INSERT @ 0/17C4700: prev 0/17C46C8; xid 1009; len 34: Btree -
insert: rel 1663/16384/16397; tid 1/1
LOG: INSERT @ 0/17C4748: prev 0/17C4700; xid 1009; len 20: Btree -
newroot: rel 1663/16384/16398; root 1 lev 0
LOG: INSERT @ 0/17C4780: prev 0/17C4748; xid 1009; len 34: Btree -
insert: rel 1663/16384/16398; tid 1/1
LOG: INSERT @ 0/17C47C8: prev 0/17C4780; xid 1009; len 12: Transaction -
commit: 2013-02-25 18:20:46.633599+01
LOG: XLOG flush request 0/17C47F8; write 0/0; flush 0/0
Just as stated in this chapter, PostgreSQL will first add a row to the table itself (heap).
Then the XLOG will contain all entries that are index-related. Finally, a commit
record will be added.
In all, 156 bytes have made it to the XLOG. This is far more than the data we have
actually added. Consistency, performance (indexes), and reliability come with a price.
The following two sections describe those two types of replication slots in detail.
[ 53 ]
Understanding the PostgreSQL Transaction Log
Of course, a replication slot can also be dangerous. What if a slave disconnects and
does not come back within a reasonable amount of time to consume all of the XLOG
kept in stock? In this case, the consequences for the master are not good, because by
the time the master fills up, it might be too late.
To use replication slots in your setup, you have to tell PostgreSQL via postgresql.
conf to allow this feature, otherwise the system will error out immediately:
Before the database is restarted, make sure that wal_level is set to at least archive
to make this work. Once the database has been restarted, the replication slot can be
created:
test=# SELECT * FROM pg_create_physical_replication_slot('my_slot');
slot_name | xlog_position
-----------+---------------
my_slot |
(1 row)
In this scenario, a replication slot called my_slot has been created. So far, no XLOG
position has been assigned to it yet.
[ 54 ]
Chapter 2
In the next step, it is possible to check which replication slots already exist. In this
example, there is only one, of course. To retrieve the entire list, just select from pg_
replication_slots:
test=# \d pg_replication_slots
View "pg_catalog.pg_replication_slots"
Column | Type | Modifiers
--------------+---------+-----------
slot_name | name |
plugin | name |
slot_type | text |
datoid | oid |
database | name |
active | boolean |
xmin | xid |
catalog_xmin | xid |
restart_lsn | pg_lsn |
The important thing here is that a physical replication slot has been created. It can
be used to stream XLOG directly, so it is really about physical copies of the data.
If a replication slot is not needed any more, it can be deleted. To do so, the following
instruction can be helpful:
test=# SELECT pg_drop_replication_slot('my_slot');
pg_drop_replication_slot
--------------------------
(1 row)
It is important to clean out replication slots as soon as they are not needed any more.
Otherwise, it might happen that the server producing the XLOG fills up and faces
troubles because of filled-up filesystems. It is highly important to keep an eye on
those replication slots and make sure that cleanup of spare slots really happens.
[ 55 ]
Understanding the PostgreSQL Transaction Log
The following example shows how a simple plugin, called test_decoding, can be
utilized to dissect the XLOG:
test=# SELECT * FROM pg_create_logical_replication_slot('slot_name',
'test_decoding');
slot_name | xlog_position
-----------+---------------
slot_name | D/438DCEB0
(1 row)
As soon as the replication slot is created, the transaction log position is returned.
This is important to know because from this position onward, the decoded XLOG
will be sent to the client.
[ 56 ]
Chapter 2
The replication slot can now be used. Logically, at this point, it won't return anything
because nothing has happened in the database system so far:
test=# SELECT * FROM pg_logical_slot_get_changes('slot_name', NULL,
NULL);
(No rows)
It is now possible to generate a table and insert some data to make use of the
replication slot:
test=# CREATE TABLE t_test (id serial, name text, PRIMARY KEY (id));
CREATE TABLE
For the sake of simplicity, a table consisting of only two columns is enough. In the
next step, two rows are added:
test=# INSERT INTO t_test (name) VALUES ('hans'), ('paul') RETURNING *;
id | name
----+------
1 | hans
2 | paul
(2 rows)
[ 57 ]
Understanding the PostgreSQL Transaction Log
A couple of things need to be taken care of here. One of them is that DDLs (in our
case, CREATE TABLE) are not decoded and cannot be replicated through replication
slots. Therefore the 937 transaction is just an empty thing. The more important
details can be seen in the 938 transaction. As data has been added in one transaction,
data is also logically decoded in a single transaction.
The important thing is that data is only decoded once. If the function is called again,
it won't return anything in this case:
test=# SELECT * FROM pg_logical_slot_get_changes('slot_name', NULL,
NULL);
location | xid | data
----------+-----+------
(0 rows)
[ 58 ]
Chapter 2
The system will add a timestamp to the end of the commit record of the row.
[ 59 ]
Understanding the PostgreSQL Transaction Log
The new and the old keys are represented. However, somebody might want to
see more. To extract more information, ALTER TABLE can be called. The REPLICA
IDENTITY parameter can be set to FULL:
The UPDATE command will now create a more verbose representation of the changed
row. The content is way more beefy and makes the functioning of many client
applications a bit easier:
test=# SELECT * FROM pg_logical_slot_get_changes('slot_name', NULL,
NULL);
location | xid |
data
------------+-----+------------------------------------------------------
--------------------------------------------
D/4390B818 | 944 | BEGIN 944
D/4390B818 | 944 | table public.t_test: UPDATE: old-key: id[integer]:1
name[text]:'hans'
new-tuple: id[integer]:10 name[text]:'hans'
D/4390B908 | 944 | COMMIT 944
(3 rows)
The most important difference is that in the case of the name column, both versions
(before and after) are included in the textual representation.
• DEFAULT: This records the old values of the columns of the primary key,
if any.
• USING INDEX: This index records the old values of the columns covered by
the named index, which must be unique, not partial, and not deferrable, and
must include only columns marked as NOT NULL.
[ 60 ]
Chapter 2
• FULL: This records the old values of all the columns in the row.
• NOTHING: This records no information about the old row.
The default setting is usually fine. However, in some situations (especially auditing),
a bit more information is definitely worth creating.
Summary
In this chapter, you learned about the purpose of the PostgreSQL transaction log.
We talked extensively about the on-disk data format and some highly important
topics, such as consistency and performance. All of them will be needed when we
replicate our first database in the next chapter.
The next chapter will build on top of what you have just learned, and focus on
Point-in-time-Recovery (PITR). Therefore, the goal will be to make PostgreSQL
return to a certain point in time and provide data as if some later transactions
never happened.
[ 61 ]
Understanding Point-in-time
Recovery
So far, you have endured a fair amount of theory. However, since life does not
only consist of theory (as important as it may be), it is definitely time to dig
into practical stuff.
The goal of this chapter is to make you understand how you can recover your
database to a given point in time. When your system crashes, or when somebody
just happens to drop a table accidentally, it is highly important not to replay the
entire transaction log but just a fraction of it. Point-in-time Recovery will be the tool
to do this kind of partial replay of the transaction log.
In this chapter, you will learn all you need to know about Point-in-time Recovery
(PITR), and you will be guided through practical examples. Therefore, we will
apply all the concepts you have learned in Chapter 2, Understanding the PostgreSQL
Transaction Log, to create some sort of incremental backup or set up a simple,
rudimentary standby system.
At the end of this chapter, you should be able to set up Point-in-time Recovery easily.
[ 63 ]
Understanding Point-in-time Recovery
A dump is always consistent. This means that all the foreign keys are
intact. New data added after starting the dump will be missing. It is
the most common way to perform standard backups.
But what if your data is so valuable and, maybe, so large in size that you want to
back it up incrementally? Taking a snapshot from time to time might be enough for
some applications, but for highly critical data, it is clearly not. In addition to that,
replaying 20 TB of data in textual form is not efficient either. Point-in-time Recovery
has been designed to address this problem. How does it work? Based on a snapshot
of the database, the XLOG will be replayed later on. This can happen indefinitely or
up to a point chosen by you. In this way, you can reach any point in time.
This method opens the door to many different approaches and features:
[ 64 ]
Chapter 3
The beauty of the design is that you can basically use an arbitrary shell script to
archive the transaction log. Here are some ideas:
The possible options for managing XLOG are only limited by the imagination.
[ 65 ]
Understanding Point-in-time Recovery
Keep in mind that when archive_command fails for some reason, PostgreSQL will
keep the XLOG file and retry after a couple of seconds. If archiving fails constantly
from a certain point onwards, it might happen that the master fills up. The sequence
of XLOG files must not be interrupted; if a single file is missing, you cannot continue
to replay XLOG. All XLOG files must be present because PostgreSQL needs an
uninterrupted sequence of XLOG files. Even if a single file is missing, the recovery
process will stop there.
The first thing you have to do when it comes to Point-in-time Recovery is archive the
XLOG. PostgreSQL offers all the configuration options related to archiving through
postgresql.conf.
Let us see step by step what has to be done in postgresql.conf to start archiving:
Let's set up archiving now. To do so, we should create a place to put the XLOG.
Ideally, the XLOG is not stored on the same hardware as the database instance you
want to archive. For the sake of this example, we assume that we want to copy an
archive to /archive. The following changes have to be made to postgresql.conf:
wal_level = archive
# minimal, archive, hot_standby, or logical
# (change requires restart)
archive_mode = on
# allows archiving to be done
# (change requires restart)
archive_command = 'cp %p /archive/%f'
# command to use to archive a logfile segment
# placeholders: %p = path of file to archive
# %f = file name only
[ 66 ]
Chapter 3
Once these changes have been made to postgresql.conf, archiving is ready for action.
To activate these change, restarting the database server is necessary.
• minimal
• archive
• hot_standby
• logical
To replay the transaction log, at least archive is needed. The difference between
archive and hot_standby is that archive does not have to know about the
currently running transactions. However, for streaming replication, as will be
covered in the next chapters, this information is vital.
If you are planning to use logical decoding, wal_level must be set to logical
to make sure that the XLOG contains even more information, which is needed
to support logical decoding. Logical decoding requires the most verbose XLOG
currently available in PostgreSQL.
The easiest way to check whether our archiving works is to create some useless
data inside the database. The following snippet shows that a million rows can be
made easily:
test=# CREATE TABLE t_test AS SELECT * FROM generate_series(1, 1000000);
SELECT 1000000
[ 67 ]
Understanding Point-in-time Recovery
We have simply created a list of numbers. The important thing is that one million
rows will trigger a fair amount of XLOG traffic. You will see that a handful of files
have made it to the archive:
iMac:archivehs$ ls -l /archive/
total 131072
-rw------- 1 hs wheel 16777216 Mar 5 22:31 000000010000000000000001
-rw------- 1 hs wheel 16777216 Mar 5 22:31 000000010000000000000002
-rw------- 1 hs wheel 16777216 Mar 5 22:31 000000010000000000000003
-rw------- 1 hs wheel 16777216 Mar 5 22:31 000000010000000000000004
If you want to save storage, you can also compress those XLOG files.
Just add gzip to your archive_command. This complicates things a
little, but this little complexity certainly benefits users.
Keep in mind that the XLOG itself is more or less worthless. It is only
useful in combination with the initial base backup.
In PostgreSQL, there are two main options to create an initial base backup:
• Using pg_basebackup
• Traditional methods based on copy /rsync
[ 68 ]
Chapter 3
Note that pg_dump cannot be used for a base backup because a binary copy of the
data is required. The pg_dump provides a textual representation of the data and not
a binary one, so it is not feasible here.
The following two sections will explain in detail how a base backup can be created.
Using pg_basebackup
The first and most common method of creating a backup of an existing server
is by running a command called pg_basebackup, which was first introduced in
PostgreSQL 9.1.0. Previously, other techniques that are described later in this book
were available. Basically, pg_basebackup is able to fetch a database base backup
directly over a database connection. When executed on the slave, pg_basebackup
will connect to the database server of your choice, and copy all the data files in the
data directory to your machine. There is no need to log in to the box anymore, and
all it takes is one line of code to run it; pg_basebackup will do all the rest for you.
In this example, we will assume that we want to take a base backup of a host called
sample.postgresql-support.de. The following steps must be performed:
Modifying pg_hba.conf
To allow remote boxes to log in to a PostgreSQL server and stream XLOG, you have
to explicitly allow replication.
In PostgreSQL, there is a file called pg_hba.conf that tells the server which boxes
are allowed to connect using which type of credentials. Entire IP ranges can be
allowed or simply discarded through pg_hba.conf.
To enable replication, we have to add one line for each IP range we want to allow.
The following listing contains an example of a valid configuration:
# TYPE DATABASE USER ADDRESS METHOD
host replication all 192.168.0.34/32 md5
[ 69 ]
Understanding Point-in-time Recovery
Once we have told the server acting as data source to accept streaming connections,
we can move forward and run pg_basebackup.
Usage:
pg_basebackup [OPTION]...
[ 70 ]
Chapter 3
-R, --write-recovery-conf
write recovery.conf after backup
-T, --tablespace-mapping=OLDDIR=NEWDIR
relocate tablespace in OLDDIR to NEWDIR
-x, --xlog include required WAL files in backup (fetch
mode)
-X, --xlog-method=fetch|stream
include required WAL files with specified method
--xlogdir=XLOGDIR location for the transaction log directory
-z, --gzip compress tar output
-Z, --compress=0-9 compress tar output with given compression level
General options:
-c, --checkpoint=fast|spread
set fast or spread checkpointing
-l, --label=LABEL set backup label
-P, --progress show progress information
-v, --verbose output verbose messages
-V, --version output version information, then exit
-?, --help show this help, then exit
Connection options:
-d, --dbname=CONNSTR connection string
-h, --host=HOSTNAME database server host or socket directory
-p, --port=PORT database server port number
-s, --status-interval=INTERVAL
time between status packets sent to server (in
seconds)
-U, --username=NAME connect as specified database user
-w, --no-password never prompt for password
-W, --password force password prompt (should happen
automatically)
[ 71 ]
Understanding Point-in-time Recovery
When we create a base backup as shown in this section, pg_basebackup will connect
to the server and wait for a checkpoint to happen before the actual copy process is
started. In this mode, this is necessary because the replay process will start exactly at
this point in the XLOG. The problem is that it might take a while until a checkpoint
kicks in; pg_basebackup does not enforce a checkpoint on the source server
straightaway so as to make sure that normal operations are not disturbed.
By default, a plain base backup will be created. It will consist of all the files in
directories found on the source server. If the base backup has to be stored on a tape,
we recommend that you give –-format=t a try. It will automatically create a tar
archive (maybe on a tape). If you want to move data to a tape, you can easily save an
intermediate step in this way. When using tar, it is usually quite beneficial to use it
in combination with --gzip to reduce the size of the base backup on the disk.
There is also a way to see a progress bar while performing the base
backup, but we don't recommend using this option (--progress)
because it requires pg_basebackup to determine the size of the
source instance first, which can be costly.
Backup throttling
If your master has only a weak I/O subsystem, it can happen that a full blown base
backup creates so much load on the master's I/O system that normal transactions
begin to suffer. Performance goes down and latency goes up. This is clearly an
undesirable thing. Therefore, pg_basebackup allows backup speeds to be reduced.
By reducing the throughput of the backup tool, the master will have more capacity
left to serve normal requests.
To reduce the speed of the backup, you can use the -r (--max-rate) command-line
switch. The expected unit is in KB. However, megabytes are also possible. Just pass
the desired throughput as a parameter and stretch the backup to make sure that your
master server does not suffer from the backup. The optimal value for these settings
highly depends on your hardware infrastructure and performance requirements,
so administrators have to determine the right speed as needed.
[ 72 ]
Chapter 3
But what if we want to create a base backup that can live without (explicitly archived)
XLOG? In such a case, we can use the --xlog-method=stream option. If this option
has been chosen, pg_basebackup will not only copy the data as it is, but also stream
the XLOG being created during the base backup to our destination server. This will
provide us with just enough XLOG to allow us to start a base backup made directly.
It is self-sufficient and does not need additional XLOG files. This is not Point-in-time
Recovery, but it can come in handy if there is trouble. Having a base backup, which
can be started right away, is usually a good thing and it comes at fairly low cost.
If you are planning to use Point-in-time Recovery and if there is absolutely no need
to start the backup as it is, you can safely skip the XLOG and save some space this
way (default mode).
The main advantage of this old method is that there is no need to open a database
connection, and no need to configure the XLOG streaming infrastructure on the
source server.
Another main advantage is that you can make use of features such as ZFS snapshots
or similar means that can help reduce dramatically the amount of I/O needed to
create an initial backup.
[ 73 ]
Understanding Point-in-time Recovery
Tablespace issues
If you happen to use more than one tablespace, pg_basebackup will handle this
just fine, provided the filesystem layout on the target box is identical to the
filesystem layout on the master. However, if your target system does not use the
same filesystem layout, there will be a bit more work to do. Using the traditional
way of doing the base backup might be beneficial in this case.
If you are using --format=t (for tar), you will be provided with one tar file
per tablespace.
If you want to limit rsync to, say, 20 MB/sec, you can simply use
rsync --bwlimit=20000. This will definitely cause creation of
the base backup to take longer, but it will make sure that your client
apps don't face problems. In general, we recommend a dedicated
network which connects the master and slave to make sure that a
base backup does not affect normal operations.
[ 74 ]
Chapter 3
Of course, you can use any other tools to copy data and achieve similar results.
To get you started, we have decided to include a simple recovery.conf sample file
for performing a basic recovery process:
restore_command = 'cp /archive/%f %p'
recovery_target_time = '2015-10-10 13:43:12'
[ 75 ]
Understanding Point-in-time Recovery
To tell the system when to stop recovery, we can set recovery_target_time. This
variable is actually optional. If it has not been specified, PostgreSQL will recover
until it runs out of XLOG. In many cases, simply consuming the entire XLOG is a
highly desirable process. If something crashes, you would want to restore as much
data as possible, but this is not always so. If you want to make PostgreSQL stop
recovery at a specific point in time, you simply have to put in the proper date. The
crucial part here is actually to know how far you want to replay XLOG. In a real
work scenario, this has proven to be the trickiest question to answer.
Before starting PostgreSQL, you have to run chmod 700 on the directory containing
the base backup, otherwise PostgreSQL will error out:
iMac:target_directoryhs$ pg_ctl -D /target_directory \
start
server starting
FATAL: data directory "/target_directory" has group or world access
DETAIL: Permissions should be u=rwx (0700).
This additional security check is supposed to make sure that your data directory
cannot be read by some user accidentally. Therefore, an explicit permission change
is definitely an advantage from a security point of view (better safe than sorry).
Now that we have all the pieces in place, we can start the replay process by
starting PostgreSQL:
iMac:target_directoryhs$ pg_ctl –D /target_directory \
start
server starting
LOG: database system was interrupted; last known up at 2015-03-10
18:04:29 CET
LOG: creating missing WAL directory "pg_xlog/archive_status"
[ 76 ]
Chapter 3
The first line indicates that PostgreSQL has realized that it has been interrupted
and has to start recovery. From the database instance point of view, a base backup
looks more or less like a crash needing some instant care by replaying XLOG.
This is precisely what we want.
The next couple of lines (restored log file ...) indicate that we are replaying
one XLOG file after the other. These files have been created since the base backup.
It is worth mentioning that the replay process starts at the sixth file. The base backup
knows where to start, so PostgreSQL will automatically look for the right XLOG file.
[ 77 ]
Understanding Point-in-time Recovery
The message displayed after PostgreSQL reaches the sixth file (consistent
recovery state reached at 0/60000B8) is important. PostgreSQL states that it
has reached a consistent state. The reason is that the data files inside a base backup
are actually broken by definition (refer to Chapter 2, Understanding the PostgreSQL
Transaction Log in the The XLOG and replication section), but the data files are not
broken beyond repair. As long as we have enough XLOG to recover, we are very
well off. If you cannot reach a consistent state, your database instance will not be
usable and your recovery cannot work without providing additional XLOG.
Once we have reached a consistent state, one file after another will be replayed
successfully until the system finally looks for the 00000001000000000000000B file.
The problem is that this file has not been created by the source database instance.
Logically, an error pops up.
Not finding the last file is absolutely normal. This type of error is
expected if recovery_target_time does not ask PostgreSQL to
stop recovery before it reaches the end of the XLOG stream. Don't
worry! Your system is actually fine. The error just indicates that you
have run out of XLOG. You have successfully replayed everything to
the file that shows up exactly before the error message.
As soon as all of the XLOG has been consumed and the error message discussed
earlier has been issued, PostgreSQL reports the last transaction that it was able to or
supposed to replay and starts up. You now have a fully recovered database instance
and can connect to the database instantly. As soon as the recovery has ended,
recovery.conf will be renamed by PostgreSQL to recovery.done to make sure that
it does not do any harm when the new instance is restarted later on at some point.
The recovery.conf file has all you need. If you want to replay up to a certain
transaction, you can refer to recovery_target_xid. Just specify the transaction you
need and configure recovery_target_inclusive to include that very transaction
mentioned in recovery_target_xid or not to include it. Using this setting is
technically easy, but as mentioned before, it can be really difficult to find a specific
transaction ID.
In a typical setup, the best way to find a reasonable point to stop recovery is to use
pause_at_recovery_target. If this is set to true, PostgreSQL will not automatically
turn into a productive instance if the recovery point has been reached. Instead, it will
wait for further instructions from the database administrator. This is especially useful
if you don't know exactly how far to replay. You can replay, log in, see how far the
database is, change to the next target time, and continue replaying in small steps.
Resuming recovery after PostgreSQL has paused can be done by calling a simple
SQL statement—SELECT pg_xlog_replay_resume(). This will make the instance
move to the next position you have set in recovery.conf. Once you have found the
right place, you can set pause_at_recovery_target back to false and call pg_xlog_
replay_resume. Alternatively, you can simply utilize pg_ctl –D ... promote to
stop the recovery and make the instance operational.
Was this explanation too complicated? Let us boil it down to a simple list:
1. Change recovery_target_time
2. Run SELECT pg_xlog_replay_resume()
3. Check again and repeat this section if necessary
[ 79 ]
Understanding Point-in-time Recovery
Keep in mind that once the recovery has completed and once
PostgreSQL has started up as a normal database instance, there is
(as of 9.4) no way to replay XLOG later on. Instead of going through
this process, you can—of course—always use filesystem snapshots.
A filesystem snapshot will always work with PostgreSQL because
when you restart a "snapshotted" database instance, it will simply
believe that it had crashed before and recover normally.
Keep in mind, however, that you must keep enough XLOG so that you can always
perform recovery from the latest base backup. But if you are certain that a specific
base backup is not needed anymore, you can safely clean out all of the XLOG that
is older than the base backup you want to keep.
How can an administrator figure out what to delete? The best method is to simply
take a look at their archive directory:
000000010000000000000005
000000010000000000000006
000000010000000000000006.00000020.backup
000000010000000000000007
000000010000000000000008
Check out the filename in the middle of the listing. The .backup file has been created
by the base backup. It contains some information about the way the base backup
has been made, and tells the system where to continue replaying the XLOG. If the
backup file belongs to the oldest base backup you need to keep around, you can
safely erase all contents of the XLOG files lower than file number 6. In this case,
file number 5 can be safely deleted.
[ 80 ]
Chapter 3
The .backup file will also provide you with relevant information, such as the
time the base backup has been made. It is plainly shown there, and so it should
be easy for ordinary users to read this information.
However, if your system crashes, the potential data loss can be seen as the amount
of data in your last unfinished XLOG file. Maybe, this is not good enough for you.
It is also possible to make PostgreSQL switch to the next XLOG file manually.
A procedure named pg_switch_xlog() is provided by the server to do this job:
test=# SELECT pg_switch_xlog();
pg_switch_xlog
----------------
0/17C0EF8
(1 row)
You might want to call this procedure when some important patch job has finished, or
if you want to make sure that a certain chunk of data is safely in your XLOG archive.
[ 81 ]
Understanding Point-in-time Recovery
Summary
In this chapter, you learned about Point-in-time Recovery, which is a safe and easy
way to restore your PostgreSQL database to any desired point in time. PITR will
help you implement better backup policies and make your setups more robust.
In the next chapter, we will extend this topic and turn to asynchronous replication.
You will learn how to replicate an entire database instance using the PostgreSQL
transaction log.
[ 82 ]
Setting Up Asynchronous
Replication
After performing our first PITR, we are ready to work on a real replication setup.
In this chapter, you will learn how to set up asynchronous replication and streaming.
The goal is to make sure that you can achieve higher availability and higher
data security.
At the end of this chapter, you will be able to easily set up streaming replication in
a couple of minutes.
Missing the last XLOG file, which has not been finalized (and thus
not sent to the archive and lost because of the crash), is often the core
reason that people report data loss in the case of PITR.
[ 83 ]
Setting Up Asynchronous Replication
In this scenario, streaming replication will be the solution to your problem. With
streaming replication, the replication delay will be minimal and you can enjoy an
extra level of protection for your data.
Let's talk about the general architecture of the PostgreSQL streaming infrastructure.
The following diagram illustrates the basic system design:
You have already seen this type of architecture. What we have added here is the
streaming connection. It is basically a normal database connection, just as you would
use in any other application. The only difference is that in the case of a streaming
connection, the connection will be in a special mode so as to be able to carry the XLOG.
[ 84 ]
Chapter 4
Now that the master knows that it is supposed to produce enough XLOG, handle
XLOG senders, and so on, we can move on to the next step.
For security reasons, you must configure the master to enable streaming replication
connections. This requires changing pg_hba.conf as shown in the previous
chapter. Again, this is needed to run pg_basebackup and the subsequent streaming
connection. Even if you are using a traditional method to take the base backup, you
still have to allow replication connections to stream the XLOG, so this step
is mandatory.
Once your master has been successfully configured, you can restart the database (to
make wal_level and max_wal_senders work) and continue working on the slave.
To fetch the base backup, we can use pg_basebackup, just as was shown in the
previous chapter. Here is an example:
iMac:dbhs$ pg_basebackup -D /target_directory \
-h sample.postgresql-support.de\
--xlog-method=stream
Now that we have taken a base backup, we can move ahead and configure
streaming. To do so, we have to write a file called recovery.conf (just like before).
Here is a simple example:
standby_mode = on
primary_conninfo= ' host=sample.postgresql-support.de port=5432 '
Note that from PostgreSQL 9.3 onwards, there is a -R flag for pg_basebackup,
which is capable of automatically generating recovery.conf. In other words,
a new slave can be generated using just one command.
• standby_mode: This setting will make sure that PostgreSQL does not stop
once it runs out of XLOG. Instead, it will wait for new XLOG to arrive. This
setting is essential in order to make the second server a standby, which
replays XLOG constantly.
[ 85 ]
Setting Up Asynchronous Replication
• primary_conninfo: This setting will tell our slave where to find the master.
You have to put a standard PostgreSQL connect string (just like in libpq)
here. The primary_conninfo variable is central and tells PostgreSQL to
stream XLOG.
For a basic setup, these two settings are totally sufficient. All we have to do now is to
fire up the slave, just like starting a normal database instance:
iMac:slavehs$ pg_ctl -D / start
server starting
LOG: database system was interrupted; last known up
at 2015-03-17 21:08:39 CET
LOG: creating missing WAL directory
"pg_XLOG/archive_status"
LOG: entering standby mode
LOG: streaming replication successfully connected
to primary
LOG: redo starts at 0/2000020
LOG: consistent recovery state reached at 0/3000000
The database instance has successfully started. It detects that normal operations have
been interrupted. Then it enters standby mode and starts to stream XLOG from the
primary system. PostgreSQL then reaches a consistent state and the system is ready
for action.
This is the default configuration. The slave instance is constantly in backup mode
and keeps replaying XLOG.
If you want to make the slave readable, you have to adapt postgresql.conf on the
slave system; hot_standby must be set to on. You can set this straightaway, but you
can also make this change later on and simply restart the slave instance when you
want this feature to be enabled:
[ 86 ]
Chapter 4
The restart will shut down the server and fire it back up again. This is not too much
of a surprise; however, it is worth taking a look at the log. You can see that a process
called walreceiver is terminated.
Once we are back up and running, we can connect to the server. Logically, we are
only allowed to perform read-only operations:
test=# CREATE TABLE x (id int4);
ERROR: cannot execute CREATE TABLE in a read-only transaction
The server will not accept writes, as expected. Remember, slaves are read-only.
• wal_sender
• wal_receiver
The wal_sender instances are processes on the master instance that serve XLOG
to their counterpart on the slave, called wal_receiver. Each slave has exactly one
wal_receiver parameter, and this process is connected to exactly one wal_sender
parameter on the data source.
[ 87 ]
Setting Up Asynchronous Replication
How does this entire thing work internally? As we have stated before, the connection
from the slave to the master is basically a normal database connection. The
transaction log uses more or less the same method as a COPY command would do.
Inside the COPY mode, PostgreSQL uses a little micro language to ship information
back and forth. The main advantage is that this little language has its own parser,
and so it is possible to add functionality fast and in a fairly easy, non-intrusive way.
As of PostgreSQL 9.4, the following commands are supported:
What you see is that the protocol level is pretty close to what pg_basebackup offers
as command-line flags.
In many cases, however, the situation is a bit more delicate. Let's assume for this
example that we want to use a master to spread data to dozens of servers. The
overhead of replication is actually very small (common wisdom says that the
overhead of a slave is around 3 percent of overall performance—however, this is just
a rough estimate), but if you do something small often enough, it can still be an issue.
[ 88 ]
Chapter 4
It is definitely not very beneficial for the master to have, say, 100 slaves.
An additional use case is as follows: having a master in one location and a couple of
slaves in some other location. It does not make sense to send a lot of data over a long
distance over and over again. It is a lot better to send it once and dispatch it to the
other side.
To make sure that not all servers have to consume the transaction log from a single
master, you can make use of cascaded replication. Cascading means that a master
can stream its transaction log to a slave, which will then serve as the dispatcher and
stream the transaction log to further slaves.
The slaves to the far right of the diagram could serve as dispatchers again. With this
very simple method, you can basically create a system of infinite size.
The procedure to set things up is basically the same as that for setting up a single
slave. You can easily take base backups from an operational slave (postgresql.conf
and pg_hba.conf have to be configured just as in the case of a single master).
[ 89 ]
Setting Up Asynchronous Replication
PostgreSQL offers some simple ways to do this. The first way, and most likely the
most convenient way, to turn a slave into a master is by using pg_ctl:
iMac:slavehs$ pg_ctl -D /target_directory promote
server promoting
iMac:slavehs$ psql test
psql (9.2.4)
Type "help" for help.
test=# CREATE TABLE sample (id int4);
CREATE TABLE
The promote command will signal the postmaster and turn your slave into a master.
Once this is complete, you can connect and create objects.
If you've got more than one slave, make sure that those slaves are
manually repointed to the new master before the promotion.
To use the file-based method, you can add the trigger_file command to your
recovery.conf file:
trigger_file = '/some_path/start_me_up.txt'
PostgreSQL will check for the file you have defined in recovery.conf every 5
seconds. For most cases, this is perfectly fine, and fast enough by far..
The good news is that PostgreSQL allows you to actually mix file-based and
streaming-based replication. You don't have to decide whether streaming-based or
file-based is better. You can have the best of both worlds at the very same time.
How can you do that? In fact, you have already seen all the ingredients; we just have
to put them together in the right way.
[ 91 ]
Setting Up Asynchronous Replication
In our case, we have simply opened an entire network to allow replication (to keep
the example simple).
Once we have made these changes, we can restart the master and take a base backup
as shown earlier in this chapter.
[ 92 ]
Chapter 4
In the next step, we can write a simple recovery.conf file and put it into the main
data directory:
restore_command = 'cp /archive/%f %p'
standby_mode = on
primary_conninfo = ' host=sample.postgresql-support.de port=5432 '
trigger_file = '/some_path/start_me_up.txt'
1. PostgreSQL will call restore_command to fetch the transaction log from the
archive.
2. It will do so until no more files can be found in the archive.
3. PostgreSQL will try to establish a streaming connection.
4. It will stream if data exists.
You can keep streaming as long as necessary. If you want to turn the slave into a
master, you can again use pg_ctl promote or the trigger_file file defined in
recovery.conf.
Error scenarios
The most important advantage of a dual strategy is that you can create a cluster that
offers a higher level of security than just plain streaming-based or plain file-based
replay. If streaming does not work for some reason, you can always fall back on
the files.
In this section, we will discuss some typical error scenarios in a dual strategy cluster.
If the streaming connection fails, PostgreSQL will try to keep syncing itself through
the file-based channel. Should the file-based channel also fail, the slave will sit there
and wait for the network connection to come back. It will then try to fetch XLOG and
simply continue once this is possible again.
[ 93 ]
Setting Up Asynchronous Replication
1. In the first scenario, the slave is streaming. If the stream is okay and intact,
the slave will not notice that some XLOG file somehow became corrupted in
the archive. The slaves never need to read from the XLOG files as long as
the streaming connection is operational.
2. If we are not streaming but replaying from a file, PostgreSQL will inspect
every XLOG record and see whether its checksum is correct. If anything
goes wrong, the slave will not continue to replay the corrupted XLOG. This
will ensure that no problems can propagate and no broken XLOG can be
replayed. Your database might not be complete, but it will be the same and
consistent up to the point of the erroneous XLOG file.
Surely, there is a lot more that can go wrong, but given those likely cases, you can
see clearly that the design has been made as reliable as possible.
[ 94 ]
Chapter 4
In many real-world scenarios, two ways of transporting the XLOG might be too
complicated. In many cases, it is enough to have just streaming. The point is that in
a normal setup, as described already, the master can throw the XLOG away as soon
as it is not needed to repair the master anymore. Depending on your checkpoint
configuration, the XLOG might be around for quite a while or only a short time. The
trouble is that if your slave connects to the master, it might happen that the desired
XLOG is not around anymore. The slave cannot resync itself in this scenario. You
might find this a little annoying, because it implicitly limits the maximum downtime
of your slave to your master's checkpoint behavior.
Using wal_keep_segments
To make your setup much more robust, we recommend making heavy use of wal_
keep_segments. The idea of this postgresql.conf setting (on the master) is to teach
the master to keep more XLOG files around than theoretically necessary. If you set
this variable to 1000, it essentially means that the master will keep 16 GB of more
XLOG than needed. In other words, your slave can be gone for 16 GB (in terms of
changes to the master) longer than usual. This greatly increases the odds that a slave
can join the cluster without having to completely resync itself from scratch. For a 500
MB database, this is not worth mentioning, but if your setup has to hold hundreds of
gigabytes or terabytes, it becomes an enormous advantage. Producing a base backup
of a 20 TB instance is a lengthy process. You might not want to do this too often, and
you definitely won't want to do this over and over again.
[ 95 ]
Setting Up Asynchronous Replication
What are the reasonable values for wal_keep_segments? As always, this largely
depends on your workloads. From experience, we can tell that a multi-GB implicit
archive on the master is definitely an investment worth considering. Very low values
for wal_keep_segments might be risky and not worth the effort. Nowadays, pace is
usually cheap. Small systems might not need this setting, and large ones should have
sufficient spare capacity to absorb the extra requirements. Personally, I am always in
favor of using at least some extra XLOG segments.
The question now is: how can a replication slot be used? Basically, it is very simple.
All that has to be done is create the replication slot on the master and tell the slave
which slots to use through recovery.conf.
Once the base backup has happened, the slave can be configured easily:
standby_mode = 'on'
primary_conninfo = 'host=master.postgresql-support.de port=5432 user=hans
password=hanspass'
primary_slot_name = 'repl_slot'
[ 96 ]
Chapter 4
The configuration is just as if there were no replication slots. The only change is
that the primary_slot_name variable has been added. The slave will pass the name
of the replication slot to the master, and the master knows when to recycle the
transaction log. As mentioned already, if a slave is not in use anymore, make sure
that the replication slot is properly deleted to avoid trouble on the master (running
out of disk space and other troubles). The problem is that this is incredibly insidious.
Slaves, being optional, are not always monitored as they should be. As such, it might
be a good idea to recommend that you regularly compare pg_stat_replication
with pg_replication_slots for mismatches worthy of further investigation.
In this section, you will learn what kind of settings there are and how you can
make use of those features easily.
What applies to Keynesian stimulus is equally true in the case of XLOG archiving;
you simply cannot keep doing it forever. At some point, the XLOG has to be
thrown away.
You can make PostgreSQL execute some cleanup routine (or anything else) as soon
as the restart point is reached. It is easily possible to clean the older XLOG or trigger
some notifications.
[ 97 ]
Setting Up Asynchronous Replication
The following script shows how you can clean any XLOG that is older than a day:
#!/bin/sh
find /archive "-type f -mtime +1 -exec rm -f {} \;
Keep in mind that your script can be of any kind of complexity. You have to decide
on a proper policy to handle the XLOG. Every business case is different, and you
have all the flexibility to control your archives and replication behavior.
Again, you can use this to clean the old XLOG, send out notifications, or perform any
other kind of desired action.
Conflict management
In PostgreSQL, the streaming replication data flows in one direction only. The XLOG
is provided by the master to a handful of slaves, which consume the transaction log
and provide you with a nice copy of the data. You might wonder how this could ever
lead to conflicts. Well, there can be conflicts.
Consider the following scenario: as you know, data is replicated with a very small
delay. So, the XLOG ends up at the slave after it has been made on the master. This
tiny delay can cause the scenario shown in the following diagram:
[ 98 ]
Chapter 4
Let's assume that a slave starts to read a table. It is a long read operation. In the
meantime, the master receives a request to actually drop the table. This is a bit of a
problem, because the slave will still need this data to perform its SELECT statement.
On the other hand, all the requests coming from the master have to be obeyed under
any circumstances. This is a classic conflict.
• Don't replay the conflicting transaction log before the slave has terminated
the operation in question
• Kill the query on the slave to resolve the problem
The first option might lead to ugly delays during the replay process, especially if
the slave performs fairly long operations. The second option might frequently kill
queries on the slave. The database instance cannot know by itself what is best for
your application, so you have to find a proper balance between delaying the replay
and killing queries.
max_standby_archive_delay = 30s
# max delay before canceling queries
# when reading WAL from archive;
# -1 allows indefinite delay
max_standby_streaming_delay = 30s
# max delay before canceling queries
# when reading streaming WAL;
# -1 allows indefinite delay
[ 99 ]
Setting Up Asynchronous Replication
In the previous example, we have shown that a conflict may show up if a table is
dropped. This is an obvious scenario; however, it is by no means the most common
one. It is much more likely that a row is removed by VACUUM or HOT-UPDATE
somewhere, causing conflicts on the slave.
Conflicts popping up once in a while can be really annoying and trigger bad
behavior of your applications. In other words, if possible, conflicts should be
avoided. We have already seen how replaying the XLOG can be delayed. These are
not the only mechanisms provided by PostgreSQL. There are two more settings we
can use.
The first, and older one of the two, is the setting called vacuum_defer_cleanup_age.
It is measured in transactions, and tells PostgreSQL when to remove a line of data.
Normally, a line of data can be removed by VACUUM if no more transactions can see
the data anymore. The vacuum_defer_cleanup_age parameter tells VACUUM to not
clean up a row immediately but wait for some more transactions before it can go
away. Deferring cleanups will keep a row around a little longer than needed. This
helps the slave to complete queries that are relying on old rows. Especially if your
slave is the one handling some analytical work, this will help a lot in making sure
that no queries have to die in vain.
[ 100 ]
Chapter 4
How can we figure out that the timeline has advanced? Let's take a look at the XLOG
directory of a system that was just turned into a master:
00000002.history
000000020000000000000006
000000020000000000000007
000000020000000000000008
The first part of the XLOG files is an interesting thing. You can observe that so far,
there was always a 1 near the beginning of our filename. This is not so anymore.
By checking the first part of the XLOG filename, you can see that the number has
changed over time (after turning the slave into a master, we have reached timeline
number 2).
It is important to mention that (as of PostgreSQL 9.4) you cannot simply pump the
XLOG of timeline 5 into a database instance that is already at timeline 9. It is simply
not possible.
In PostgreSQL 9.3, we are able to handle these timelines a little more flexibly. This
means that timeline changes will be put to the transaction log, and a slave can follow
a timeline shift easily.
[ 101 ]
Setting Up Asynchronous Replication
Delayed replicas
So far, two main scenarios have been discussed in this book:
Both scenarios are highly useful and can serve people's needs nicely. However, what
happens if databases, and especially change volumes, start being really large? What
if a 20 TB database has produced 10 TB of changes and something drastic happens?
Somebody might have accidentally dropped a table or deleted a couple of million
rows, or maybe somebody set data to a wrong value. Taking a base backup and
performing a recovery might be way too time consuming, because the amount of
data is just too large to be handled nicely.
The same applies to performing frequent base backups. Creating a 20 TB base backup
is just too large, and storing all those backups might be pretty space consuming.
Of course, there is always the possibility of getting around certain problems on the
filesystem level. However, it might be fairly complex to avoid all of those pitfalls in
a critical setup.
[ 102 ]
Chapter 4
Handling crashes
It is generally wise to use a transaction log archive when using a lagging slave. In
addition to that, a crash of the master itself has to be handled wisely. If the master
crashes, the administrator should make sure that they decide on a point in time to
recover to. Once this point has been found (which is usually the hardest part of the
exercise), recovery_target_time can be added to recovery.conf. Once the slave
has been restarted, the system will recover to this desired point in time and go live. If
the time frames have been chosen wisely, this is the fastest way to recover a system.
Summary
In this chapter, you learned about streaming replication. We saw how a streaming
connection can be created, and what you can do to configure streaming to your
needs. We also briefly discussed how things work behind the scenes.
It is important to keep in mind that replication can indeed cause conflicts, which
need proper treatment.
In the next chapter, it will be time to focus our attention on synchronous replication,
which is logically the next step. You will learn to replicate data without potential
data loss.
[ 103 ]
Setting Up Synchronous
Replication
So far, we have dealt with file-based replication (or log shipping, as it is often called)
and a simple streaming-based asynchronous setup. In both cases, data is submitted
and received by the slave (or slaves) after the transaction has been committed on the
master. During the time between the master's commit and the point when the slave
actually has fully received the data, it can still be lost.
[ 105 ]
Setting Up Synchronous Replication
When setting up synchronous replication, try to keep the following things in mind:
In many cases, it is perfectly fine to lose a couple of rows in the event of a crash.
Synchronous replication can safely be skipped in this case. However, if there is
zero tolerance, synchronous replication is a tool that should be used.
[ 106 ]
Chapter 5
application_name
------------------
psql
(1 row)
Well, the story goes like this: if this application_name value happens to be part
of synchronous_standby_names, the slave will be a synchronous one. In addition
to that, to be a synchronous standby, it has to be:
• connected
• streaming data in real-time (that is, not fetching old WAL records)
With all of this information in mind, we can move forward and configure our first
synchronous replication.
[ 107 ]
Setting Up Synchronous Replication
A couple of changes have to be made to the master. The following settings will be
needed in postgresql.conf on the master:
wal_level = hot_standby
max_wal_senders = 5 # or any number
synchronous_standby_names = 'book_sample'
hot_standby = on
# on the slave to make it readable
In the next step, we can perform a base backup just as we have done before. We
have to call pg_basebackup on the slave. Ideally, we already include the transaction
log when doing the base backup. The --xlog-method=stream parameter allows us
to fire things up quickly and without any greater risks.
Once the base backup has been performed, we can move ahead and write a simple
recovery.conf file suitable for synchronous replication, as follows:
standby_mode = on
[ 108 ]
Chapter 5
The config file looks just like before. The only difference is that we have added
application_name to the scenery. Note that the application_name parameter
must be identical to the synchronous_standby_names setting on the master.
In our example, the slave is on the same server as the master. In this case, you
have to ensure that those two instances will use different TCP ports, otherwise the
instance that starts second will not be able to fire up. The port can easily be changed
in postgresql.conf.
After these steps, the database instance can be started. The slave will check out
its connection information and connect to the master. Once it has replayed all the
relevant transaction logs, it will be in synchronous state. The master and the slave
will hold exactly the same data from then on.
To check for replication, we can connect to the master and take a look at pg_stat_
replication. For this check, we can connect to any database inside our (master)
instance, as follows:
postgres=# \x
Expanded display is on.
postgres=# SELECT * FROM pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid | 62871
usesysid | 10
usename | hs
application_name | book_sample
client_addr | ::1
client_hostname |
client_port | 59235
backend_start | 2013-03-29 14:53:52.352741+01
state | streaming
sent_location | 0/30001E8
[ 109 ]
Setting Up Synchronous Replication
write_location | 0/30001E8
flush_location | 0/30001E8
replay_location | 0/30001E8
sync_priority | 1
sync_state | sync
This system view will show exactly one line per slave attached to your master system.
The \x command will make the output more readable for you. If you
don't use \x to transpose the output, the lines will be so long that it
will be pretty hard for you to comprehend the content of this table. In
expanded display mode, each column will be in one line instead.
You can see that the application_name parameter has been taken from the connect
string passed to the master by the slave (which is book_sample in our example). As
the application_name parameter matches the master's synchronous_standby_names
setting, we have convinced the system to replicate synchronously. No transaction
can be lost anymore because every transaction will end up on two servers instantly.
The sync_state setting will tell you precisely how data is moving from the master
to the slave.
Consider what you have learned about the CAP theory earlier in this
book. Synchronous replication is exactly where it should be, with the
serious impact that the physical limitations will have on performance.
[ 110 ]
Chapter 5
The main question you really have to ask yourself is: do I really want to replicate
all transactions synchronously? In many cases, you don't. To prove our point, let's
imagine a typical scenario: a bank wants to store accounting-related data as well as
some logging data. We definitely don't want to lose a couple of million dollars just
because a database node goes down. This kind of data might be worth the effort of
replicating synchronously. The logging data is quite different, however. It might be
far too expensive to cope with the overhead of synchronous replication. So, we want
to replicate this data in an asynchronous way to ensure maximum throughput.
Setting synchronous_commit to on
In the default PostgreSQL configuration, synchronous_commit has been set to on.
In this case, commits will wait until a reply from the current synchronous standby
indicates that it has received the commit record of the transaction and has flushed it
to the disk. In other words, both servers must report that the data has been written
safely. Unless both servers crash at the same time, your data will survive potential
problems (crashing of both servers should be pretty unlikely).
Keep in mind that this can have a serious impact on your application. Imagine a
transaction committing on the master and you wanting to query that data instantly
on one of the slaves. There would still be a tiny window during which you can
actually get outdated data.
[ 111 ]
Setting Up Synchronous Replication
Setting synchronous_commit to local can also cause a small time delay window,
during which the slave can actually return slightly outdated data. This phenomenon
has to be kept in mind when you decide to offload reads to the slave.
[ 112 ]
Chapter 5
test=# COMMIT;
COMMIT
In this example, we changed the durability requirements on the fly. This will make
sure that this very specific transaction will not wait for the slave to flush to the disk.
Note, as you can see, sync_state has not changed. Don't be fooled by what you see
here; you can completely rely on the behavior outlined in this section. PostgreSQL is
perfectly able to handle each transaction separately. This is a unique feature of this
wonderful open source database; it puts you in control and lets you decide which
kind of durability requirements you want.
Let's assume a simple test: in the following scenario, we have connected two equally
powerful machines (3 GHz, 8 GB RAM) over a 1 Gbit network. The two machines are
next to each other. To demonstrate the impact of synchronous replication, we have
left shared_buffers and all other memory parameters as default, and only changed
fsync to off to make sure that the effect of disk wait is reduced to practically zero.
The test is simple: we use a one-column table with only one integer field and 10,000
single transactions consisting of just one INSERT statement:
INSERT INTO t_test VALUES (1);
[ 113 ]
Setting Up Synchronous Replication
As you can see, the test has taken around 6 seconds to complete. This test can
be repeated with synchronous_commit = local now (which effectively means
asynchronous replication):
real 0m0.909s
user 0m0.101s
sys 0m0.142s
In this simple test, you can see that the speed has gone up by us much as six times.
Of course, this is a brute-force example, which does not fully reflect reality (this was
not the goal anyway). What is important to understand, however, is that synchronous
versus asynchronous replication is not a matter of a couple of percentage points or
so. This should stress our point even more: replicate synchronously only if it is really
needed, and if you really have to use synchronous replication, make sure that you
limit the number of synchronous transactions to an absolute minimum.
Also, please make sure that your network is up to the job. Replicating data
synchronously over network connections with high latency will kill your system
performance like nothing else. Keep in mind that throwing expensive hardware at
the problem will not solve the problem. Doubling the clock speed of your servers
will do practically nothing for you because the real limitation will always come
from network latency.
When synchronous replication is used, how can you still make sure that performance
does not suffer too much? Basically, there are a couple of important suggestions that
have proven to be helpful:
• Use longer transactions: Remember that the system must ensure on commit
that the data is available on two servers. We don't care what happens in
the middle of a transaction, because anybody outside our transaction cannot
see the data anyway. A longer transaction will dramatically reduce
network communication.
[ 114 ]
Chapter 5
• Run stuff concurrently: If you have more than one transaction going on at
the same time, it will be beneficial to performance. The reason for this is that
the remote server will return the position inside the XLOG that is considered
to be processed safely (flushed or accepted). This method ensures that many
transactions can be confirmed at the same time.
At first glance, this looks like nonsense, but if you think about it more deeply, you
will figure out that synchronous replication is actually the only correct thing to do.
If somebody decides to go for synchronous replication, the data in the system must
be worth something, so it must not be at risk. It is better to refuse data and cry out to
the end user than to risk data and silently ignore the requirements of high durability.
If you decide to use synchronous replication, you must consider using at least three
nodes in your cluster. Otherwise, it will be very risky, and you cannot afford to lose
a single node without facing significant downtime or risking data loss.
Summary
In this chapter, we outlined the basic concept of synchronous replication, and
showed how data can be replicated synchronously. We also showed how durability
requirements can be changed on the fly by modifying PostgreSQL runtime parameters.
PostgreSQL gives users the choice of how a transaction should be replicated, and
which level of durability is necessary for a certain transaction.
In the next chapter, we will dive into monitoring and see how we can figure out
whether our replicated setup is working as expected.
[ 115 ]
Monitoring Your Setup
In previous chapters of this book, you learned about various kinds of replication
and how to configure various types of scenarios. Now it's time to make your setup
more reliable by adding monitoring. Monitoring is a key component of every critical
system, and therefore a deep and thorough understanding of it is essential in order
to keep your database system up and running.
In this chapter, you will learn what to monitor and how to implement reasonable
monitoring policies. You will learn how to do the following:
At the end of this chapter, you should be able to monitor any kind of replication
setup properly.
Of course, there are countless other things that can go wrong. However, in this
chapter, our goal is to focus on the most common issues people face.
[ 117 ]
Monitoring Your Setup
Checking archive_command
A failing archive_command variable might be one of the greatest showstoppers in
your setup. The purpose of archive_command is to push XLOG to some archive and
store the data there. But what happens if those XLOG files cannot be pushed for
some reason?
The answer is quite simple: the master has to keep these XLOG files to ensure that
no XLOG files can be lost. There must always be an uninterrupted sequence of
XLOG files. Even if a single file in the sequence of files is missing, your slave won't
be able to recover anymore. For example, if your network has failed, the master will
accumulate those files and keep them. Logically, this cannot be done forever, and so,
at some point, you will face disk space shortages on your master server.
This can be dangerous, because if you are running out of disk space, there is no way
to keep writing to the database. While reads might still be possible, most writes will
definitely fail and cause serious disruptions on your system. PostgreSQL won't fail
and your instance will be intact after a disk has filled up, but—as stated before—your
service will be interrupted.
The core question here is: what would be a reasonable number to check for? In
a standard configuration, PostgreSQL should not use more XLOG files than
checkpoint_segments * 2 + wal_keep_segments. If the number of XLOG files starts
skyrocketing , you can expect some weird problem.
If you perform these checks properly, nothing bad can happen on this front.
But if you fail to check these parameters, you are risking doomsday!
[ 118 ]
Chapter 6
Apart from disk space, which has to be monitored anyway, there is one more thing
you should keep on your radar. You have to come up with a decent policy to handle
base backups. Remember that you are allowed to delete XLOG only if it is older than
the oldest base backup that you want to preserve. This tiny thing can undermine your
disk space monitoring. It is highly recommended to make sure that your archive has
enough spare capacity. This is important if your database system has to write many
transaction logs.
Checking pg_stat_replication
Checking the archive and archive_command is primarily for Point-In-Time-
Recovery. If you want to monitor a streaming-based setup, it is advised to keep
an eye on a system view called pg_stat_replication. This view contains the
following information:
test=# \d pg_stat_replication
View "pg_catalog.pg_stat_replication"
Column | Type | Modifiers
------------------+--------------------------+-----------
pid | integer |
usesysid | oid |
usename | name |
application_name | text |
client_addr | inet |
client_hostname | text |
client_port | integer |
backend_start | timestamp with time zone |
backend_xmin | xid |
state | text |
sent_location | pg_lsn |
write_location | pg_lsn |
flush_location | pg_lsn |
replay_location | pg_lsn |
sync_priority | integer |
sync_state | text |
For each slave connected to our system via streaming, the view listed here will
return exactly one line of data. You will see precisely what your slaves are doing.
[ 119 ]
Monitoring Your Setup
[ 120 ]
Chapter 6
• flush_location: This is the last location that was flushed to the standby
system. Mind the difference between writing and flushing here. Writing
does not imply flushing (refer A simple INSERT statement section in Chapter 2,
Understanding the PostgreSQL Transaction Log about durability requirements).
• replay_location: This is the last transaction log position that is replayed
on the slave.
• sync_priority: This field is relevant only if you are replicating
synchronously. Each sync replica will choose a priority—sync_priority—
that will tell you which priority has been chosen.
• sync_state: Finally, you can see which state the slave is in. The state can
be async, sync, or potential. PostgreSQL will mark a slave as potential
when there is a sync slave with a higher priority.
Remember that each record in this system view represents exactly one slave.
So, you can see at the first glance who is connected and what task is being done.
The pg_stat_replication variable is also a good way to check whether a slave
is connected in the first place.
On the master, we can simply check for a process called wal_sender. On the slave,
we have to check for a process called wal_receiver.
Let's check out what we are supposed to see on the master first. The following
command does the job:
ps ax | grep "wal sender"
On Linux, we can see that the process carries not only its purpose (in this case,
wal_sender) but also the name of the end user and network-related information.
In our case, we can see that somebody has connected from localhost through
port 61498.
[ 121 ]
Monitoring Your Setup
If both the processes on master and slave are here, you have already got a
pretty good indicator that your replication setup is working nicely.
To read data from the view, you can simply use TABLE pg_replication_slots.
The main problem is that an administrator has to know about those replication
slots that are still needed and in use. The authors are not aware of any automated,
feasible methods to clean up replication slots that are not in use anymore. At this
point, it really boils down to knowledge about the system.
What you should do is to periodically check the difference between the current
transaction ID and the transaction ID listed for the replication slot. In addition to
that the LSNs can be checked. Unusually high differences can be reported by the
monitoring system.
[ 122 ]
Chapter 6
Actually, it is not so easy to determine what the usual range is as it might vary. In
most cases, the difference in range will be very close to zero. However, if a large
index is built and the network is slow, the difference might peak temporarily,
making it hard to determine a good value for the threshold.
The result is in bytes. Note that if the replication lag (or difference) constantly
grows, it might indicate that the replication has stopped.
If the age of a certain transaction ID is needed, the age function can be used:
test=# SELECT age('223'::xid);
age
-----
722
(1 row)
If a replication slot still provides a very old transaction log, it might not be in
use anymore. Again, dropping a replication slot should be done only after
thorough checking.
One of the most popular monitoring tools around is Nagios. It is widely used and
supports a variety of software components.
[ 123 ]
Monitoring Your Setup
Installing check_postgres
Once you have downloaded the plugin from the Bucardo website, it is easy to
install it, as follows:
2. Now you can enter the newly created directory and run Perl Makefile:
perl Makefile.PL
The last step must be performed as the root user. Otherwise, you
will most likely not have enough permissions to deploy the code
on your system.
Starting check_postgres.pl directly is also a way to call those plugins from the
command-line prompt and checking whether the results make sense.
[ 124 ]
Chapter 6
It really depends on the type of application you are running, so you have to think
yourself and come up with a reasonable set of checks and thresholds. Logically, the
same applies to any other monitoring software you can potentially think of. The rules
are always the same: think about what your application does, and consider things
that can go wrong. Based on this information, you can then select proper checks. A
list of all available checks can be found at http://bucardo.org/check_postgres/
check_postgres.pl.html.
It is important to mention that certain checks that are not related to PostgreSQL are
always useful. The following list contains a—nowhere complete—list of suggestions:
• CPU usage
• Disk space consumption and free space
• Memory consumption
• Disk wait
• Operating system swap usage
• General availability of the desired services
Summary
In this chapter, you learned a lot about monitoring. We saw what to check for in
the archive and how to interpret the PostgreSQL internal system views. Finally,
we saw which processes to check for at the operating system level.
All of these checks will, together, provide you with a pretty nice safety net for
your database setup.
The next chapter is dedicated exclusively to High Availability. You will be taught
the important concepts related to High Availability, and we will guide you through
the fundamentals.
[ 125 ]
Understanding Linux High
Availability
High Availability (HA) is not just an industry term. The fact that something needs to
be available usually also means that it is super important, and that the service must
be available to clients, no matter the cost.
[ 127 ]
Understanding Linux High Availability
"Anything" really includes everything in life. This is well understood by all service
providers who intend to retain their customers. Customers usually aren't satisfied if
the service they want is not continuous, or not available. Availability is also called
uptime, and its opposite is called downtime.
Depending on the service, downtime can be more or less tolerated. For example,
if a house is heated using wood or coal, the homeowner can stack up a lot of it
before winter to avoid depending on the availability of shipping during the winter.
However, if the house is heated using natural gas, availability is a lot more important.
Uninterrupted service (there should be enough pressure in the gas pipe coming into
the house) and a certain heating quality of the gas are expected from the provider.
Also, we will talk about the perception of High Availability when the downtime
is hidden.
Measuring availability
The idea behind availability is that the service provider tries to guarantee a certain
level of it, and clients can then expect that or more. In some cases (depending on
service contracts), a penalty fee or a decreased subscription fee is the consequence
of an unexpected downtime.
However, let's consider the 0.001 percent downtime for the "five nines" case.
Users experience denied or delayed service only 5 minutes and 15 seconds in total
(that is, 864 milliseconds every day) during the entire year, which may not be
noticed at all. Because of this, the service is perceived to be uninterrupted.
The second and third examples show that no matter what the provider does,
there is a minimum downtime, and the uptime converges to the maximum that
can be provided.
[ 128 ]
Chapter 7
Let's see what planned downtime means and what can be done to hide it. Let's take
a theoretical factory and its workers. The workers operate on certain machinery and
they expect it to work during their work hours. The factory can have different shifts,
so the machinery may not be turned off at all, except for one week of maintenance.
The workers are told to take their vacation during this time period. If there is really
no other downtime, everyone is happy. On the other hand, if there is downtime, it
means lost income for the factory and wasted time and lower wages for the workers.
Let's look at the "one hour every day" downtime. It amounts to more than two weeks
in total, which is kind of surprising. It's actually quite a lot if added together. But
in some cases, the service is really not needed for that single hour during the whole
day. For example, a back office database can have automatic maintenance scheduled
for the night, when there are no users in the office. In this way, there is
no perceived downtime; the system is always up when the users need it.
Another example of this "one hour downtime every day" is a non-stop hypermarket.
Cash registers usually have to be switched to daily report mode before the first
payment on the next day; otherwise, they refuse to accept further payments.
These reports must be printed for accounting and the tax office. Being a non-stop
hypermarket, it doesn't actually close its doors, but the customers cannot pay and
leave until the cash registers are switched back to service mode.
For a simple example, we can consider a cluster consisting of two database servers.
One is running as the primary server, accepting writes from our application. The
other one is running as a standby, replicating to the primary changes if something
bad, such as a crash, happens. The question we have to answer now is how many
changes we can afford to lose and how often. In some systems, such as those
collecting sensor statistics, it's not a huge tragedy, or even noticeable, if a couple of
minutes' worth of work goes missing occasionally. In other cases, such as financial
systems, we will be in deep trouble if even a single transaction goes missing.
[ 129 ]
Understanding Linux High Availability
If losing some data is okay, then the primary system can report a success as soon as
it has successfully processed the data. This is called asynchronous replication. If the
primary fails in this system, there is a pretty good chance that a few transactions will
not be replicated and will therefore be lost.
On the other hand, if we need to ensure that no data is lost under any circumstances,
we would want to wait for a confirmation that the standby has successfully written
the data to the disk. What happens if the standby fails? If losing data in any single
failure is not acceptable, then our system must stop accepting write transactions once
either of the nodes fails. The system will remain unavailable until a new standby is
set up, which can take a considerable amount of time. If both High Availability and
high durability are required, then at least three database nodes are a must.
Detecting failures
For a system to be highly available, failures need to be detected and corrective action
needs to be taken. For systems requiring lesser availability, it is acceptable to page
the database administrator when anomalies occur and have them wake up and
figure out what happened. To provide High Availability, you will need a rotation
of database administrators sitting on guard in case something bad happens, in order
to have even a remote chance of reacting fast enough. Obviously, automation is
required here.
At first glance, it might seem that automating failovers is easy enough—just have
the standby server run a script that checks whether the primary is still there, and
if not, promote itself to primary. This will actually work okay in most common
scenarios. However, to get High Availability, thinking about common scenarios
is not enough. You have to make the system behave correctly in the most unusual
situations you can think of, and even more importantly also in situations that you
can't think of. This is because Murphy's law means that all those situations will
happen in your case.
What happens if the primary is only temporarily blocked and starts receiving
connections again? Will you have two primary servers then? What happens if, due
to a network failure or misconfiguration, the two cluster nodes can't communicate
with each other, but can communicate with the application? Will you be able to avoid
repeated failovers? Is your system correctly handling timeouts? This is where using
a tool that implements some of these concerns becomes handy. Pacemaker and the
Linux-HA stack are two such tools.
[ 130 ]
Chapter 7
The simplest and most common way to provide quorum is to say that whenever a
group of nodes consists of a majority (more than 50 percent), this group has quorum.
This system isn't very useful in a two-node cluster, as the smallest number that can
have majority is 2, which means that quorum is lost once either of the nodes goes
down. This is great for guaranteeing durability, but not so good for availability.
There are some ways to securely run a two node cluster, but the simplest solution
is to add a third node that has the sole duty of being a tiebreaker in a network split.
Just remember that if the tiebreaker node shares a single point of failure with either
of the two database nodes, that point becomes a point of failure for the entire
cluster. So, don't put it into a small VM that shares a physical server with one of
the database nodes.
Quorum means that the winning side knows that it has the right to run the primary,
but this is only half of the story. The cluster also needs a way to make sure that the
servers that become unresponsive are truly down, not just temporarily incapacitated.
This is called fencing the node, or sometimes, a more gruesome acronym: shoot the
other node in the head (STONITH). There are many fencing mechanisms, but one
of the most effective mechanisms is the use of an integrated management computer
in the target server, or a remotely controllable power delivery unit to just turn the
power off and on again. This ensures that the cluster can quickly determine with
certainty that the old primary is no longer there, and promote a new one.
[ 131 ]
Understanding Linux High Availability
Understanding Linux-HA
The Linux-HA stack is built from a set of modular components that provide
services. The major parts of this stack are the messaging layer (Corosync), the
cluster manager (Pacemaker), application adaptor scripts (resource agents), fencing
adaptors for STONITH (fence agents), and command-line tools for configuring all
of this (pcs). The details on how this stack is put together have varied over time and
across distributions, so if you are using an older distribution, you might have some
differences compared to what is being described here. For newer distributions, the
digital world seems to be stabilizing and you shouldn't have major differences.
Corosync
The Corosync messaging layer is responsible for knowing which nodes are up and
part of the clusters, handling reliable communications between nodes, and storing
the cluster state in a reliable and consistent in-memory database.
Corosync uses a shared key for communications, which you need to generate when
setting up your cluster using the corosync-keygen tool. The cluster members can be
configured to communicate in different ways. The most common way requires that
you pick a multicast address and port for multicast communications, and make sure
that the multicast UDP traffic actually works on your network. By using this method,
you don't need to configure a list of nodes in your configuration. Alternatively, you
can set up the system to use a unicast communication method called udpu. When you
use this method, you need to list all the cluster nodes in your corosync.conf file.
When using RHEL6 and derivative distributions, you will need to set up CMAN
using cluster.conf, and CMAN will take care of starting Corosync for you.
Pacemaker
Pacemaker is the brain of the system. It is the portion that makes decisions on
what to fail over and where. Pacemaker uses Corosync to vote one system to be
the designated controller node. This node is then responsible for instructing all
the other nodes on what to do.
The most important part to understand about Pacemaker is that you don't have to tell
Pacemaker what to do. Instead, you have to describe what you would like the cluster
to look like, and Pacemaker does its best to deliver within the given constraints. You
do this by describing the resources you want to be running, and assigning scores that
tell Pacemaker how the resources need to be placed and in what order they need to
start and stop. Pacemaker uses this information to figure out what the desired state is
and what actions need to be taken to get there. If something changes (for example, a
node fails), the plan is recalculated and the corresponding corrective action is taken.
[ 132 ]
Chapter 7
For things where you want the cluster to pass configuration parameters, or things
that need more complex cluster awareness, Open Cluster Framework (OCF) resource
agents are used. Resource agents can, in principle, be written in any language, but
are mostly written as shell scripts. There exists an official resource-agents package
that contains most of the common scripts you would want to use, including a pgsql
resource agent for managing PostgreSQL database clusters.
If a suitable resource agent is not available, don't be afraid to write your own.
The Linux-HA website has a comprehensive development guide for resource
agents, and you can pick any of the existing ones as an example to get you started.
Fence agents are similar to resource agents, but are only used for adapting different
fencing mechanisms. They too are freestanding executables that follow a specific
calling convention. The methods used include turning off the power through a
networked power switch, issuing a hard reset through a server management interface,
resetting a VM through the hypervisor, and calling a cloud API requesting a reset or
waiting for an operator to manually toggle a switch and tell the cluster that fencing has
been done. The scripts can also be stacked, so the less disruptive methods can be tried
first, and the bigger guns should come out only when trying to be nice fails.
PCS
Internally, Pacemaker's configuration database is stored as an XML document.
While it is possible to change the configuration by directly modifying the XML,
most people prefer to retain their sanity and use a tool that presents a friendlier
interface. The latest and greatest such tool is called PCS. On older systems, you
might have encountered a tool called Crmsh. The core concepts are the same, so
don't worry too much if you are not able to use PCS.
[ 133 ]
Understanding Linux High Availability
When picking the node to promote to master, the resource agent checks the WAL
positions of all the nodes and picks the one furthest ahead. This avoids losing
transactions when both the nodes go down and are come up again at the same time.
This algorithm also eliminates the necessity to reimage a node lacking transactions that
are not on the main timeline. However, at the same time, it does mean that sometimes,
an undesirable node can be chosen to become the master. In this case, the preferred
procedure is to force a failover manually (see the later section on enforcing a failover).
This is important, take care to remove this lock file after your
maintenance operations are completed in order to allow a
slave server to start up.
[ 134 ]
Chapter 7
If both the slaves have the DISCONNECTED status, it means that something is seriously
broken, and it is recommended to manually check the control file's XLOG position
and the timeline history files on both hosts to find out exactly what is going on.
Then, update the cluster state manually (using the crm_attr ibute utility), like
this for example:
crm_attribute -N <node name> -n "pgsql-data-status" -v "LATEST"
For the network setup, it is really important that the hostnames are set properly before
you start setting up High Availability. Pacemaker uses hostnames to identify systems,
and that makes changing a node's hostname a rather painful procedure. Make sure
that the hostname lookup works properly on all nodes, either by using a DNS server
or by adding entries to /etc/hosts on all the nodes. Set reasonably short hostnames
and avoid including too much metadata in the name, as that makes life easier when
operating the cluster. The prod-db123 instance is nice and easy to type and doesn't
need to be changed when stuff gets shuffled around, database1-vhost418-rack7.
datacenter-europe.production.mycompany.org is not so easy. Don't name one
database node primary and the other standby, as after a failover, that will no longer
be true and will create unnecessary confusion in an already tricky situation.
For our example, we will call the nodes db1, db2, and quorum1. The corresponding
IPs of the hosts are 10.1.1.11, 10.1.1.12, and 10.1.1.13.
You will also need to pick a cluster IP address. You don't need to do anything special
with the address; just ensure that nothing else is using it and that the address you
pick is in the same network as the database nodes. This cluster IP will be managed
by Pacemaker and added as a secondary IP to the host with the primary. For our
example, we will pick 10.1.1.100 as the IP address.
[ 135 ]
Understanding Linux High Availability
You can see the list of IP addresses assigned to an interface using the
ip addr command:
You will, of course, also need to install PostgreSQL. Make sure that PostgreSQL
is not automatically started by the operating system. This interferes with normal
cluster operations and can cause split-brain scenarios.
Once you have the packages installed, you need to start the pcsd service, enable
it to be started at boot-up time, and set a password for the hacluster user added
by the pcs package. Remember to do this on all nodes.
It's possible to run the cluster without running the pcsd service. Then you will
have to set up Corosync keys and configurations by yourself.
[ 136 ]
Chapter 7
Then we set up the cluster using the default settings. You can also adjust the cluster
communication method here, or configure multiple redundant network interfaces for
communication. Refer to the documentation for more details:
[root@db1 ~]# pcs cluster setup --name pgcluster db1 db2 quorum1
Shutting down pacemaker/corosync services...
Redirecting to /bin/systemctl stop pacemaker.service
Redirecting to /bin/systemctl stop corosync.service
Killing any remaining services...
Removing all cluster configuration files...
db1: Succeeded
db2: Succeeded
quorum1: Succeeded
Finally, we can start the clustering software and verify that it has started up:
[root@db1 ~]# pcs cluster start --all
db1: Starting Cluster...
db2: Starting Cluster...
quorum1: Starting Cluster...
[root@db1 ~]# pcs status corosync
Membership information
--------------------------
Nodeid Votes Name
1. 1 db1 (local)
2. 1 db2
3. 1 quorum1
In our example, the database will be set up in the /pgsql/data directory, and
will run under the operating system user postgres, so we change the filesystem
privileges correspondingly:
[root@db1 ~]# mkdir /pgsql
[root@db1 ~]# mkdir /pgsql/data
[root@db1 ~]# chown postgres /pgsql/data
[ 137 ]
Understanding Linux High Availability
For the pgsql resource agent, you also need to create a directory for some
runtime files:
[root@db1 ~]# mkdir /pgsql/run
[root@db1 ~]# chown postgres /pgsql/run
There are a couple of things to keep in mind when configuring your database.
In addition to the changes that every new installation needs, in postgresql.conf,
you would want to set the following replication-specific settings:
wal_level = hot_standby
We need at least 1 WAL sender to support the replication, but as they don't really
cost much, we might as well add a few spare ones for future expansion:
wal_keep_segments = 1000
Setting this parameter correctly is tricky. If the standby goes offline for a while,
during maintenance for example, we need the primary to keep enough transaction
logs around so that when the standby comes back, it can continue where it left off.
If the necessary transaction logs are no longer available, the slave needs to be
reimaged from the master.
This parameter specifies how much data the primary will keep around, the unit
being a 16 MB chunk of log. The proper value depends on how long maintenance
intervals the system should tolerate, how much transaction log is generated, and
how much disk space is available.
Alternatively, with PostgreSQL 9.4 and later, and with a new enough version of the
pgsql resource agent, it is possible to use replication slots to make the primary keep
enough transaction logs around:
max_replication_slots = 8
We only need one slot, but similarly to max_wal_senders, increasing this value
doesn't really cost anything significant, so we might as well add some headroom:
hot_standby = on
[ 138 ]
Chapter 7
Setting the hot_standby parameter to on is required so that at the very least, the
cluster-monitoring command can verify that the standby is functioning correctly.
For replication, the standby needs to be able to connect to the primary, so you need
to have a database user for replication and add an entry to pg_hba.conf that allows
that user to connect for replication. The automatically created postgres superuser
account will work for this, but is undesirable from a security standpoint.
For our example, we will create a user called standby_repl and set a password.
The following pg_hba.conf entry will allow that user to connect from our
sample network:
# TYPE DATABASE USER ADDRESS METHOD
host replication standby_repl samenet md5
We also need the pgsql resource-agent to be able to connect locally to the database.
By default, it uses the postgres user to connect locally. This is fine for us, so we just
enable it to connect without a password:
# TYPE DATABASE USER ADDRESS METHOD
local postgres postgres trust
As a first step, we will turn off fencing until we get that setup correctly.
pcs property set stonith-enabled=false
[ 139 ]
Understanding Linux High Availability
By default, Pacemaker allows resources that you create to run anywhere in the cluster,
unless you explicitly forbid it. For clusters that have databases with local storage, this
default doesn't really make sense. Every time we add a node, we would need to tell
Pacemaker that the database can't run there. Instead, we want to explicitly allow the
resources to run on specific nodes. This is achieved by turning off the symmetric-
cluster property:
This can be a bit too much to take in, so let's split it up and go through it in
smaller pieces:
pcs resource create pgsql ocf:heartbeat:pgsql \
[ 140 ]
Chapter 7
This line means that we are creating a new resource. The name of that resource
will be pgsql, and it will use the pgsql resource-agent from the OCF resource
agents, under the heartbeat directory. The following lines show the configuration
parameters for the resource agent:
pgctl="/usr/pgsql-9.4/bin/pg_ctl" \
psql="/usr/pgsql-9.4/bin/psql" \
Using these options, we tell pgsql that we want to use synchronous replication.
We also need to tell pgsql separately which nodes are expected to be running
PostgreSQL:
master_ip="10.1.1.100" \
repuser="standby_repl" \
primary_conninfo_opt="password=example" \
This block of options specified how to connect to the primary. The IP address we
use here needs to be the clustered IP address:
stop_escalate="110" \
When PostgreSQL doesn't want to stop quickly enough, the resource agent will try to
just crash it. That sounds bad, but given that otherwise we would have to reboot the
whole operating system, it is a lesser evil. To be effective, this timeout needs to
be smaller than the "stop operation" timeout.
This completes the parameters of the resource agent. You can specify many other
parameters. To see what parameters are available, run pcs resource describe
ocf:heartbeat:pgsql.
The next section specifies the parameters of different operations that the resource
agent can perform:
op start timeout=120s on-fail=restart \
[ 141 ]
Understanding Linux High Availability
The start operation is performed when PostgreSQL is not running but should be.
The timeout parameter specifies how long we will wait for PostgreSQL to come up,
and on-fail specifies that when starting PostgreSQL fails, Pacemaker should try to
stop it and then try starting again:
op stop timeout=120s on-fail=fence \
The stop operation is performed when Pacemaker needs to stop PostgreSQL, usually
when bringing a node down for maintenance. The on-fail=fence parameter here
means that if stopping PostgreSQL fails or times out, Pacemaker will reboot the
server to expedite the process:
op monitor interval=3s timeout=10s on-fail=restart \
op monitor interval=2s role=Master timeout=10s on-fail=fence \
As for the slave, we can afford to try to gracefully restart it, as nobody is really
waiting for it. When the primary stops responding, applications are waiting and the
quickest thing we can do is just pull the plug on the server and immediately promote
the slave. Because we are using synchronous replication, there will be no data lost:
op promote timeout=120s on-fail=block \
op demote timeout=120s on-fail=fence \
The promote and demote functions are called when a standby needs to become a
master, and vice-versa.
Pacemaker always starts multistate resources as slaves before figuring out which one
to promote. When promotion is required, the pgsql resource agent is able to query
the databases to figure out which one has the latest data, and request that the correct
database be promoted. This has a nice side benefit that every time a node becomes
master, there is a WAL timeline switch, and PostgreSQL can use the timeline history
to detect when an operator error would result in data corruption. When a promotion
fails, we request that Pacemaker keep its hands off so that the situation can be
disentangled manually.
For demotion, PostgreSQL doesn't have any way other than shutting itself down.
The same considerations apply here as in the stop operation case:
meta migration-threshold=2 \
[ 142 ]
Chapter 7
This line specifies the meta attributes. These are guidelines for pacemaker on how to
manage this resource. By setting migration-threshold to 2, we tell Pacemaker to
give up on starting the node if after starting, it fails twice. Giving up avoids endless
retries, which can cause nasty problems.
Once the system administrator has eliminated the cause of the failure, the resource
can be re-enabled by clearing the fail count:
--master clone-max=2
Finally, we tell Pacemaker that the resource needs to be run as a master-slave cluster
with two nodes. The default settings will want to run on the instance on each node.
The second resource for Pacemaker to manage is the IP alias for the clustered
address:
pcs resource create master-ip ocf:heartbeat:IPaddr2 \
ip="10.1.1.100" iflabel="master" \
op monitor interval=5s
This resource is significantly simpler. We only need to specify the desired IP address
and a label here. We also increase the monitoring frequency, to be on the safe side.
There are some things to note here. The first is that we have to refer to the PostgreSQL
resource as pgsql-master. This is just an internal convention of how master-slave
resource naming works in Pacemaker. The second is the master keyword, and it
means that this constraint only refers to a pgsql instance running as primary.
The order in which we specify the resources is important here. Pacemaker will first
decide where the target, pgsql master, should run and only then will it try to find a
place master-ip variable. We omitted the score attribute in this command, which
results in a score of infinity, making it a mandatory rule for Pacemaker. This means
that if this rule can't be satisfied, master-ip will not be allowed to run.
[ 143 ]
Understanding Linux High Availability
Next up is the ordering constraint. We want to try to promote a node only once we
have successfully started the cluster IP address there. Also, when we are doing a
graceful failover (bringing primary down for maintenance), we would want to give
PostgreSQL an opportunity to ship out the unsent transaction logs before we move
the cluster IP to the other host. This results in the following ordering constraint:
pcs constraint order start master-ip then promote pgsql-master
Ordering constraints are symmetrical by default, which means that this constraint
also makes Pacemaker to get a constraint that wants to demote pgsql-master before
stopping master-ip.
Finally, we set up the location constraints to allow the pgsql resource and the
master-ip resource to run on the database nodes:
We set the score to 0 here to allow the location score to be modified by Pacemaker.
Setting up fencing
The specifics of how to set up fencing depend on what kind of fencing method is
available to you. For demonstration purposes, we are going to set up IPMI-based
fencing, which uses a rather common management interface standard to
toggle power.
We don't use a fence agent for the quorum host, as nothing is allowed to run on it.
[ 144 ]
Chapter 7
We then add location constraints that allow the fence agents to run on hosts other
than the node that this specific resource is tasked with fencing:
pcs constraint location db1-fence prefers db2=0 quorum1=0
pcs constraint location db2-fence prefers db1=0 quorum1=0
First comes the header. It shows general information about the cluster. The most
interesting part here is the line that tells us which node is the current designated
controller (DC). The cluster logs on that node will contain information about what
Pacemaker decided to do and why:
Cluster name: pgcluster
Last updated: Thu Jul 16 01:31:12 2015
Last change: Wed Jul 15 23:23:21 2015
Stack: corosync
Current DC: db1 (1) - partition with quorum
Version: 1.1.12-a9c8177
3 Nodes configured
5 Resources configured
[ 145 ]
Understanding Linux High Availability
In this example, we see that db1 is running as the master, db2 is running as the slave,
and master-ip has been moved to the correct host.
This is the master promotion score that controls which node gets picked by
Pacemaker to be the master. The values are assigned by pgsql resource agents,
and their semantics are as follows:
• -INF: This instance has fallen behind and should not be considered to be
a candidate for the master.
• 100: This instance is eligible for promotion when the master fails.
• 1000: This is the desired master node.
• pgsql-data-status: This is the most important attribute specifying
whether the node is eligible to be promoted or not.
• STREAMING|SYNC: This instance is connected to the master and streams
WAL data synchronously. It is eligible for promotion.
• DISCONNECTED: This instance is disconnected from the master and is not
eligible for promotion when the master fails.
• LATEST: This instance is/was running as the master and is the preferred
candidate for promotion.
[ 146 ]
Chapter 7
According to the attributes for the standby server, pgsql RA thinks that it is connected
to the primary, is replicating synchronously, and is eligible to be promoted to master.
[ 147 ]
Understanding Linux High Availability
Forcing a failover
To force a failover, you can just bring the master server down for maintenance and
then immediately bring it back up again.
Once the maintenance mode is enabled, Pacemaker does not perform any actions
on the cluster, and you are free to perform any reconfigurations. At the same time,
you are also responsible for maintaining High Availability.
If the master role is changed during the maintenance interval or you have started/
stopped PostgreSQL, you need to tell Pacemaker to forget about any state and
manually assign the preferred master to the node where the master PostgreSQL is
currently running. Failing to do so can cause spurious failovers, or even extended
downtime, when the cluster can't correctly detect which node to promote to master.
To reset the cluster state, use the following commands:
# Clean up resource information
pcs resource cleanup pgsql
pcs resource cleanup master-ip
# Assign PostgreSQL master score 1000 to the node that is currently
mastercrm_attribute -l reboot -N db1 -n master-pgsql -v 1000
# Disable promotion on the slave node
crm_attribute -l reboot -N db2 -n master-pgsql -v -INF
When ready to go, disable the maintenance mode with the following command:
pcs property set maintenance-mode=false
It is advisable that you monitor the system with crm_mon for some time to see
whether anything unexpected happens.
[ 148 ]
Chapter 7
The following commands can be used to reset a failed PostgreSQL master to a slave
with minimal network traffic.
The pg_rewind tool needs the database to be in a safely shut down state. If we are
doing this, it is probably in a crashed state. We remove recovery.conf, start it as a
master in single user mode, and quit immediately after the recovery is done:
rm -f /pgsql/data/recovery.confsudo -u postgres /usr/pgsql-9.4/bin/
postgres --single \-D /pgsql/data/ < /dev/null
Fetch the out-of-sync data from the current master. The same parameters should be
used here as used in replication:
sudo -u postgres /usr/pgsql-9.4/bin/pg_rewind \--target-pgdata=/pgsql/
data/ \--source-server="host=10.1.1.100 user=standby_repl"
The server is now safe to start as a slave. Remove the lock file and tell Pacemaker to
retry starting it.
rm -f /pgsql/tmp/PGSQL.lockpcs resource cleanup pgsql
Summary
In this chapter, you learned how to set up a PostgreSQL HA cluster using tools that
are commonly available in Linux. In addition to that, the most frequent issues, such
as split-brain and fencing, were discussed in great detail.
[ 149 ]
Working with PgBouncer
When you are working with a large-scale installation, it is sometimes quite likely
that you have to deal with many concurrent open connections. Nobody will put
up 10 servers to serve just two concurrent users; in many cases, this simply makes
no sense. A large installation will usually have to deal with hundreds, or even
thousands, of concurrent connections. Introducing a connection pooler, such as
PgBouncer, will help improve performance of your systems. PgBouncer is really
the classic workhorse when it comes to connection pooling. It is rock solid and it
has already been around for many years.
In this chapter, we will take a deep look at PgBouncer and see how it can be
installed and used to speed up our installation. This chapter is not meant to be
a comprehensive guide to PgBouncer, and it can in no way replace the official
documentation (https://pgbouncer.github.io/overview.html).
[ 151 ]
Working with PgBouncer
PgBouncer solves this problem by placing itself between the actual database server
and the heavily used application. To the application, PgBouncer looks just like
a PostgreSQL server. Internally, PgBouncer will simply keep an array of open
connections and pool them. Whenever a connection is requested by the application,
PgBouncer will take the request and assign a pooled connection. In short, it is some
sort of proxy.
The main advantage here is that PgBouncer can quickly provide a connection to
the application, because the real database connection already exists behind the
scenes. In addition to this, a very low memory footprint can be observed. Users
have reported a footprint that is as small as 2 KB per connection. This makes
PgBouncer ideal for very large connection pools.
Installing PgBouncer
Before we dig into the details, we will see how PgBouncer can be installed. Just as
with PostgreSQL, you can take two routes. You can either install binary packages
or simply compile from source. In our case, we will show you how a source
installation can be performed:
1. The first thing you have to do is download PgBouncer from the official
website at http://pgfoundry.org/projects/pgbouncer.
[ 152 ]
Chapter 8
2. Once you have downloaded the .tar archive, you can safely unpack it
using the following command:
tar xvfz pgbouncer-1.5.4.tar.gz
3. Once you are done with extracting the package, you can enter the newly
created directory and run configure. Expect this to fail due to missing
packages. In many cases, you have to install libevent (the development
package) first before you can successfully run configure.
4. Once you have successfully executed configure, you can move forward
and run make. This will compile the source and turn it into binaries. Once
this is done, you can finally switch to root and run make install to deploy
those binaries. Keep in mind that the build process needs PostgreSQL server
development packages, as access to pg_config (a tool used to configure
external software) is needed.
[pgbouncer]
logfile = /var/log/pgbouncer.log
pidfile = /var/log/pgbouncer.pid
listen_addr = 127.0.0.1
[ 153 ]
Working with PgBouncer
listen_port = 6432
auth_type = trust
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = session
server_reset_query = DISCARD ALL
max_client_conn = 100
default_pool_size = 20
Using the same database name is not required here. You can map any database
name to any connect strings. We have just found it useful to use identical names.
Once we have written this config file, we can safely start PgBouncer and see
what happens:
hs@iMac:/etc$ pgbouncer bouncer.ini
2013-04-25 17:57:15.992 22550 LOG File descriptor limit: 1024 (H:4096),
max_client_conn: 100, max fds possible: 150
2013-04-25 17:57:15.994 22550 LOG listening on 127.0.0.1:6432
2013-04-25 17:57:15.994 22550 LOG listening on unix:/tmp/.s.PGSQL.6432
2013-04-25 17:57:15.995 22550 LOG process up: pgbouncer 1.5.4, libevent
2.0.16-stable (epoll), adns: evdns2
In production, you would configure authentication first, but let's do it step by step.
Dispatching requests
The first thing we have to configure when dealing with PgBouncer is the database
servers we want to connect to. In our example, we simply create links to p0 and
p1. We put the connect strings in, and they tell PgBouncer where to connect to. As
PgBouncer is essentially some sort of proxy, we can also map connections to make
things more flexible. In this case, mapping means that the database holding the
data does not necessarily have the same name as the virtual database exposed by
PgBouncer.
The following connection parameters are allowed: dbname, host, port, user,
password, client_encoding, datestyle, timezone, pool_size, and connect_
query. Everything up to the password is what you would use in any PostgreSQL
connect string. The rest is used to adjust PgBouncer for your needs. The most
important setting here is the pool size, which defines the maximum number of
connections allowed to this very specific PgBouncer virtual database.
[ 154 ]
Chapter 8
Note that the size of the pool is not necessarily related to the number of connections
to PostgreSQL. There can be more than just one pending connection to PgBouncer
waiting for a connection to PostgreSQL.
The important thing here is that you can use PgBouncer to relay to various different
databases on many different hosts. It is not necessary that all databases reside on the
same host, so PgBouncer can also help centralize your network configuration.
Sometimes, you simply don't want to list all database connections. Especially if
there are many databases, this can come in handy. The idea is to direct all requests
that have not been listed before to the fallback server:
* = host=fallbackserver
In our setup, PgBouncer will produce a fair amount of log entries. To channel this
log into a log file, we have used the logfile directive in our config. It is highly
recommended to write log files to make sure that you can track all the relevant
data going on in your bouncer.
[ 155 ]
Working with PgBouncer
max_client_conn
The max_client_conn setting has already been described briefly. The theoretical
maximum used is as follows:
max_client_conn + (max_pool_size * total_databases * total_users)
If a database user is specified in the connect string (all users connect under the same
username), the theoretical maximum is this:
max_client_conn + (max_pool_size * total_databases)
default_pool_size
The default_pool_size setting specifies the default number of connections
allowed for each user/database pair.
min_pool_size
The min_pool_size setting will make sure that new connections to the databases
are added as soon as the number of spare connections drops below the threshold
due to inactivity. If the load spikes rapidly, min_pool_size is vital to keep the
performance up.
reserve_pool_size
The reserve_pool_size setting specifies the number of additional connections
that can be allowed to a pool.
pool_size
The pool_size parameter sets the maximum pool size for this database. If the
variable is not set, the default_pool_size parameter is used.
[ 156 ]
Chapter 8
Authentication
Once we have configured our databases and other basic settings, we can turn
our attention to authentication. As you have already seen, this local config points
to the databases in your setup. All applications will point to PgBouncer, so all
authentication-related proceeding will actually be handled by the bouncer. How
does it work? Well, PgBouncer accepts the same authentication methods supported
by PostgreSQL, such as MD5 (the auth_file may contain passwords encrypted with
MD5), crypt (plain-text passwords in auth_file), plain (clear-text passwords),
trust (no authentication), and any (which is like trust, but ignores usernames).
The first column holds the username, then comes a tab, and finally there is either
a plain text string or a password encrypted with MD5. If you want to encrypt as a
password, PostgreSQL offers the following command:
test=# SELECT md5('abc');
md5
----------------------------------
900150983cd24fb0d6963f7d28e17f72
(1 row)
Connecting to PgBouncer
Once we have written this basic config and started up the system, we can safely
connect to one of the databases listed:
hs@iMac:~$ psql -p 6432 p1 -U hs
psql (9.4.4)
Type "help" for help.
p1=#
In our example, we are connecting to the database called p1. We can see that the shell
has been opened normally, and we can move on and issue the SQL we want just as if
we were connected to a normal database.
[ 157 ]
Working with PgBouncer
The log file will also reflect our efforts to connect to the database and state:
2013-04-25 18:10:34.830 22598 LOG C-0xbca010: p1/hs@unix:6432 login
attempt: db=p1 user=hs
2013-04-25 18:10:34.830 22598 LOG S-0xbe79c0: p1/hs@127.0.0.1:5432 new
connection to server
For each connection, we get various log entries so that an administrator can easily
check what is going on.
Java issues
If you happen to use Java as the frontend, there are some points that have to be
taken into consideration. Java tends to pass some parameters to the server as part
of the connection string. One of those parameters is extra_float_digits. This
postgresql.conf parameter governs the floating-point behavior of PostgreSQL,
and is set by Java to make things more deterministic.
The problem is that PgBouncer will only accept the tokens listed in the previous
section, otherwise it will error out.
To get around this issue, you can add a directive to your bouncer config (the
PgBouncer section of the file):
ignore_startup_parameters = extra_float_digits
This will ignore the JDBC setting and allow PgBouncer to handle the connection
normally. If you want to use Java, we recommend putting those parameters into
postgresql.conf directly to make sure that no nasty issues will pop up
during production.
Pool modes
In the configuration, you might have also seen a config variable called pool_mode,
which has not been described yet. The reason is that pool modes are so important
that we have dedicated an entire section to them.
• session
• transaction
• statement
[ 158 ]
Chapter 8
The session mode is the default mode of PgBouncer. A connection will go back to
the pool as soon as the application disconnects from the bouncer. In many cases, this
is the desired mode because we simply want to save on connection overhead—and
nothing more.
However, in some cases, it might be useful to return sessions to the pool faster.
This is especially important if there are lags between various transactions. In the
transaction mode, PgBouncer will immediately return a connection to the pool at
the end of a transaction (and not when the connection ends). The advantage of this is
that we can still enjoy the benefits of transactions, but connections are returned much
sooner, and therefore, we can use those open connections more efficiently. For most
web applications, this can be a big advantage because the lifetime of a session has to
be very short. In the transaction mode, some commands, such as SET, RESET, LOAD,
and so on, are not fully supported due to side effects.
Most people will stick to the default mode here, but you have to keep in mind that
other options exist.
Cleanup issues
One advantage of a clean and fresh connection after PostgreSQL calls fork() is the
fact that the connection does not contain any faulty settings, open cursors, or other
leftovers whatsoever. This makes a fresh connection safe to use and avoids side
effects of other connections.
As you have learned in this chapter, PgBouncer will reuse connections to avoid those
fork() calls. The question now is, "How can we ensure that some application does
not suffer from side effects caused by some other connection?"
[ 159 ]
Working with PgBouncer
Keep in mind that there is no need to run an explicit ROLLBACK command before a
connection goes back to the pool or after it is fetched from the pool. Rolling back
transactions is already done by PgBouncer automatically, so you can be perfectly
sure that a connection is never inside a transaction.
Improving performance
Performance is one of the key factors when considering PgBouncer in the first place.
To make sure that performance stays high, some issues have to be taken seriously.
First of all, it is recommended to make sure that all nodes participating in your
setup are fairly close to each other. This greatly helps reduce network round trip
times, and thus boosts performance. There is no point in reducing the overhead of
calling fork() and paying for this gain with network time. Just as in most scenarios,
reducing network time and latency is definitely a huge benefit.
[ 160 ]
Chapter 8
A simple benchmark
In this chapter, we have already outlined that it is very beneficial to use PgBouncer
if many short-lived connections have to be created by an application. To prove our
point, we have compiled an extreme example. The goal is to run a test doing as little
as possible. We want to measure merely how much time we spend in opening a
connection. To do so, we have set up a virtual machine with just one CPU.
The test itself will be performed using pgbench (a contribution module widely used
to benchmark PostgreSQL).
Then we have to write ourselves a nice sample SQL command that should be
executed repeatedly:
SELECT 1;
Now we can run an extreme test against our standard PostgreSQL installation:
hs@VM:test$ pgbench -t 1000 -c 20 -S p1 -C -f select.sql
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 20
number of threads: 1
number of transactions per client: 1000
number of transactions actually processed: 20000/20000
tps = 67.540663 (including connections establishing)
tps = 15423.090062 (excluding connections establishing)
[ 161 ]
Working with PgBouncer
We want to run 20 concurrent connections. They all execute 1,000 single transactions.
The –C option indicates that after every single transaction, the benchmark will close
the open connection and create a new one. This is a typical case on a web server
without pooling—each page might be a separate connection.
Now keep in mind that this test has been designed to look ugly. We may observe
that keeping the connection alive will make sure that we can execute roughly 15,000
transactions per second on our single VM CPU. If we have to fork a connection each
time, we will drop to just 67 transactions per second, as we have stated before.
This kind of overhead is worth thinking about.
Let's now repeat the test and connect to PostgreSQL through PgBouncer:
hs@VM:test$ pgbench -t 1000 -c 20 -S p1 -C -f select.sql -p 6432
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 20
number of threads: 1
number of transactions per client: 1000
number of transactions actually processed: 20000/20000
tps = 1013.264853 (including connections establishing)
tps = 2765.711593 (excluding connections establishing)
As you can see, our throughput has risen to 1,013 transactions per second. This is 15
times more than before—indeed a nice gain.
However, we also have to see whether our performance level has dropped since we
did not close the connection to PgBouncer. Remember that the bouncer, the benchmark
tool, and PostgreSQL are all running on the same single CPU. This does have an
impact here (context switches are not too cheap in a virtualized environment).
Keep in mind that this is an extreme example. If you repeat the same test with longer
transactions, you will see that the gap logically becomes much smaller. Our example
has been designed to demonstrate our point.
Maintaining PgBouncer
In addition to what we have already described in this chapter, PgBouncer has a nice
interactive administration interface capable of performing basic administration and
monitoring tasks.
[ 162 ]
Chapter 8
How does it work? PgBouncer provides you with a fake database called pgbouncer.
It cannot be used for queries, as it only provides a simple syntax for handling basic
administrative tasks.
If you are using PgBouncer, please don't use a normal database called
pgbouncer. It will just amount to confusion and yield zero benefits.
Don't worry about the warning message; it is just telling us that we have connected
to something that does not look like a native PostgreSQL 9.4 database instance.
[ 163 ]
Working with PgBouncer
RELOAD
PAUSE [<db>]
RESUME [<db>]
KILL <db>
SUSPEND
SHUTDOWN
SHOW
pgbouncer=# \x
Expanded display is on.
pgbouncer=# SHOW DATABASES;
-[ RECORD 1 ]+----------
name | p0
host | localhost
port | 5432
database | p0
force_user |
pool_size | 20
reserve_pool | 0
-[ RECORD 2 ]+----------
name | p1
host | localhost
port | 5432
database | p1
force_user |
pool_size | 20
[ 164 ]
Chapter 8
reserve_pool | 0
-[ RECORD 3 ]+----------
name | pgbouncer
host |
port | 6432
database | pgbouncer
force_user | pgbouncer
pool_size | 2
reserve_pool | 0
As you can see, we have two productive databases and the virtual pgbouncer
database. What is important here to see is that the listing contains the pool size as
well as the size of the reserved pool. It is a good check to see what is going on in
your bouncer setup.
Once you have checked the list of databases on your system, you can turn your
attention to the active clients in your system. To extract the list of active clients,
PgBouncer offers the SHOW CLIENTS instruction:
pgbouncer=# \x
Expanded display is on.
pgbouncer=# SHOW CLIENTS;
-[ RECORD 1 ]+--------------------
type | C
user | zb
database | pgbouncer
state | active
addr | unix
port | 6432
local_addr | unix
local_port | 6432
connect_time | 2013-04-29 11:08:54
request_time | 2013-04-29 11:10:39
ptr | 0x19e3000
link |
At the moment, we have exactly one user connection to the pgbouncer database. We
can see nicely where the connection comes from and when it has been created. SHOW
CLIENTS is especially important if there are hundreds, or even thousands, of servers
on the system.
[ 165 ]
Working with PgBouncer
Sometimes, it can be useful to extract aggregated information from the system. The
SHOW STATS parameter will provide you with statistics about what is going on in
your system. It shows how many requests have been performed, and how many
queries have been performed on an average:
pgbouncer=# SHOW STATS;
-[ RECORD 1 ]----+----------
database | pgbouncer
total_requests | 3
total_received | 0
total_sent | 0
total_query_time | 0
avg_req | 0
avg_recv | 0
avg_sent | 0
avg_query | 0
Finally, we can take a look at the memory consumption we are facing. PgBouncer
will return this information if SHOW MEM is executed:
pgbouncer=# SHOW MEM;
name | size | used | free | memtotal
--------------+------+------+------+----------
user_cache | 184 | 4 | 85 | 16376
db_cache | 160 | 3 | 99 | 16320
pool_cache | 408 | 1 | 49 | 20400
server_cache | 360 | 0 | 0 | 0
client_cache | 360 | 1 | 49 | 18000
iobuf_cache | 2064 | 1 | 49 | 103200
(6 rows)
As you can see, PgBouncer is really lightweight and does not consume much
memory as other connection pools do.
[ 166 ]
Chapter 8
The RELOAD command will reread the config, so there will be no need to restart the
entire PgBouncer tool for most small changes. This is especially useful if there is
just a new user.
An additional feature of PgBouncer is the ability to stop operations for a while. But
why would anybody want to stop queries for some time? Well, let's assume you
want to perform a small change somewhere in your infrastructure. Just interrupt
operations briefly without actually throwing errors. Of course, you have to be a
little careful to make sure that your frontend infrastructure can handle such an
interruption nicely. From the database side, however, it can come in handy.
Once you are done with your changes, you can resume normal operations easily:
pgbouncer=# RESUME;
RESUME
Once this has been called, you can continue sending queries to the server.
Finally, you can even stop PgBouncer entirely from the interactive shell, but it is
highly recommended that you be careful when doing this:
pgbouncer=# SHUTDOWN;
The connection to the server was lost. Attempting reset: Failed.
!>
[ 167 ]
Working with PgBouncer
Summary
In this chapter, you learned how to use PgBouncer for highly scalable web applications
to reduce the overhead of permanent creation of connections. We saw how to
configure the system and how we can utilize a virtual management database.
In the next chapter, you will be introduced to pgpool, a tool used to perform
replication and connection pooling. Just like PgBouncer, pgpool is open source
and can be used along with PostgreSQL to improve your cluster setups.
[ 168 ]
Working with pgpool
In the previous chapter, we looked at PgBouncer and learned how to use it to
optimize replicated setups as much as possible. In this chapter, we will take a look
at a tool called pgpool. The idea behind pgpool is to bundle connection pooling with
some additional functionality to improve replication, load balancing, and so on. It
has steadily been developed over the years and can be downloaded freely from the
www.pgpool.net website.
Installing pgpool
Just as we have seen for PostgreSQL, PgBouncer and most other tools covered in
this book, we can either install pgpool from source or just use a binary. Again, we
will describe how the code can be compiled from source.
http://www.pgpool.net/download.php?f=pgpool-II-3.4.2.tar.gz.
The installation procedure is just like we have seen already. The first thing we have
to call is configure along with some parameters. In our case, the main parameter
is --with-pgsql, which tells the build process where to find our PostgreSQL
installation.
$ ./configure --with-pgsql=/usr/local/pgsql/
[ 169 ]
Working with pgpool
The same applies to insert_lock.sql, which can be found in the sql directory
of the pgpool source code. The easiest solution is to load this into template1
directly as follows:
psql -f insert_lock.sql template1
[ 170 ]
Chapter 9
First cd into the pgpool recovery directory. The sources can be compiled:
make
make install
Once the module has been installed, a simple postgresql.conf parameter has
to be added as follows:
pgpool.pg_ctl = '/usr/local/pgsql/bin/pg_ctl'
• Connection pooling
• Statement-level replication
• Load balancing
• Limiting connections
• In-memory caching
One core feature of pgpool is the ability to do connection pooling. The idea is pretty
much the same as the one we outlined in the previous chapter. We want to reduce the
impact of forking connections each and every time a webpage is opened. Instead, we
want to keep connections open and reuse them whenever a new request comes along.
Those concepts have already been discussed for PgBouncer in the previous chapter.
One feature often requested along with replication is load balancing, and pgpool
offers exactly that. You can define a set of servers and use the pooler to dispatch
requests to the desired database nodes. It is also capable of sending a query to
the node with the lowest load.
[ 171 ]
Working with pgpool
To boost performance, pgpool offers a query cache. The goal of this mechanism is to
reduce the number of queries that actually make it to the real database servers—as
many queries as possible should be served by the cache. We will take a closer look
at this topic in this chapter.
Older versions of pgpool had a feature called "Parallel query." The idea was to
parallelize a query to make sure that it scales out better. However, this feature
has fortunately been removed. It caused a lot of maintenance work and created
some false expectations on the user side as far as how things work.
Once you have understood the overall architecture as it is from a user's point
of view, we can dig into a more detailed description:
[ 172 ]
Chapter 9
When pgpool is started, we fire up the pgpool parent process. This process will fork
and create the so-called child processes. These processes will be in charge of serving
requests to end users and handle all the interaction with our database nodes. Each
child process will handle a couple of pool connections. This strategy will reduce the
number of authentication requests to PostgreSQL dramatically.
It is a lot easier to just adapt this config file than to write things from scratch. In
the following listing, you will see a sample config that you can use for a simple
two-node setup:
listen_addresses = '*'
port = 9999
socket_dir = '/tmp'
pcp_port = 9898
pcp_socket_dir = '/tmp'
backend_hostname0 = 'localhost'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/home/hs/db'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_hostname1 = 'localhost'
backend_port1 = 5433
backend_weight1 = 1
backend_data_directory1 = '/home/hs/db2'
backend_flag1 = 'ALLOW_TO_FAILOVER'
enable_pool_hba = off
pool_passwd = 'pool_passwd'
authentication_timeout = 60
ssl = off
[ 173 ]
Working with pgpool
num_init_children = 32
max_pool = 4
child_life_time = 300
child_max_connections = 0
connection_life_time = 0
client_idle_limit = 0
connection_cache = on
reset_query_list = 'ABORT; DISCARD ALL'
replication_mode = on
replicate_select = off
insert_lock = on
load_balance_mode = on
ignore_leading_white_space = on
white_function_list = ''
black_function_list = 'nextval,setval'
Let us now discuss these settings in detail and see what each setting means:
[ 174 ]
Chapter 9
• load_balance_mode: Shall pgpool split the load to all hosts in the system?
The default setting is false.
• ignore_leading_white_space: Shall leading whitespaces of a query be
ignored or not?
• white_function_list: When pgpool runs a stored procedure, pgpool will
have no idea what it actually does. The SELECT func() call can be a read
or a write —there is no way to see from outside what will actually happen.
The white_function_list parameter will allow you to teach pgpool which
functions can be safely load balanced. If a function writes data, it must not
be load balanced—otherwise, data will be out of sync on those servers. Being
out of sync must be avoided at any cost.
• black_function_list: This is the opposite of the white_function_list
parameter. It will tell pgpool which functions must be replicated to make
sure that things stay in sync.
Keep in mind that there is an important relation between max_pool and a child
process of pgpool. A single child process can handle up to the max_pool connections.
Password authentication
Once you have come up with a working config for pgpool, we can move ahead
and configure authentication. In our case, we want to add one user called hs. The
password of hs should simply be hs. The pool_passwd variable will be in charge of
storing passwords. The format of the file is simple; it will hold the name of the user,
a colon, and the MD5-encrypted password.
Then, we can add all of this to the config file storing users and passwords. In the
case of pgpool, this file is called pcp.conf:
# USERID:MD5PASSWD
hs:789406d01073ca1782d86293dcfc0764
[ 176 ]
Chapter 9
If there is no error, we should see a handful of processes waiting for some work from
those clients out there:
$ ps ax | grep pool
30927 pts/4 S+ 0:00 pgpool -n
30928 pts/4 S+ 0:00 pgpool: wait for connection request
30929 pts/4 S+ 0:00 pgpool: wait for connection request
30930 pts/4 S+ 0:00 pgpool: wait for connection request
30931 pts/4 S+ 0:00 pgpool: wait for connection request
30932 pts/4 S+ 0:00 pgpool: wait for connection request
As you can clearly see, pgpool will show up as a handful of processes in the
process table.
Attaching hosts
Basically, we could already connect to pgpool and fire queries; but, this would
instantly lead to disaster and inconsistency. Before we can move on to some real
action, we should check the status of those nodes participating in the cluster. To
do so, we can use a tool called pcp_node_info:
$ pcp_node_info 5 localhost 9898 hs hs 0
localhost 5432 3 0.500000
$ pcp_node_info 5 localhost 9898 hs hs 1
localhost 5433 2 0.500000
The format of this call to pcp_node_info is a little complicated and not too easy to
read if you happen to see it for the first time.
Note that the weights are 0.5 here. In the configuration, we have given both backends
a weight of 1. The pgpool tool has automatically adjusted the weight so that they add
up to 1.
[ 177 ]
Working with pgpool
The first parameter is the timeout parameter. It will define the maximum time for
the request. Then, we specify the host and the port of the PCP infrastructure. Finally,
we pass a username and a password as well as the number of the host we want to
have information about. The system will respond with a hostname, a port, a status,
and the weight of the node. In our example, we have to focus our attention on the
status column. It can return four different values:
• 0: This state is only used during the initialization. PCP will never display it.
• 1: Node is up. No connections yet.
• 2: Node is up. Connections are pooled.
• 3: Node is down.
In our example, we can see that node number 1 is basically returning status 3—it is
down. This is clearly a problem because if we were to execute a write now, it would
not end up in both nodes but just in one of them.
To fix the problem, we can call pcp_attach_node and enable the node:
$ pcp_attach_node 5 localhost 9898 hs hs 0
$ pcp_node_info 5 localhost 9898 hs hs 0
localhost 5432 1 0.500000
Once we have added the node, we can check its status again. It will be up and running.
To test our setup, we can check out psql and display a list of all the databases
in the system:
$ psql -l -p 9999
List of databases
[ 178 ]
Chapter 9
Doing this basic check is highly recommended to make sure that nothing has been
forgotten and everything has been configured properly.
One more thing, which can be highly beneficial when it comes to checking a running
setup, is the pcp_pool_status parameter. It will extract information about the current
setup and show information about configuration parameters currently in use.
The syntax of this command is basically the same as for all the pcp_* commands we
have seen so far:
$ pcp_pool_status 5 localhost 9898 hs hs
name : listen_addresses
value: localhost
[ 179 ]
Working with pgpool
name : port
value: 9999
desc : pgpool accepting port number
...
In addition to this, we suggest performing the usual checks such as checking for
open ports and properly running processes. These checks should reveal if anything
of importance has been forgotten during the configuration.
In fact, it can even be beneficial to do so because you don't have to worry about
side effects of functions or potential other issues. The PostgreSQL transaction log
is always right, and it can be considered to be the ultimate law.
The pgpool statement-level replication was a good feature to replicate data before
streaming replication was introduced into the core of PostgreSQL.
In addition to this, it can be beneficial to have just one master. The reason for this is
simple. If you have just one master, it is hard to face inconsistencies. Also, the pgpool
tool will create full replicas, so data has to be replicated anyway. There is absolutely
no win if data must end up on both the servers anyway—writing to two nodes will
not make things scale any better in this case.
How can you run pgpool without replication? The process is basically quite simple,
as follows:
[ 180 ]
Chapter 9
In a basic setup, pgpool will assume that node number 0 will be the master. So, you
have to make sure that your two nodes are listed in the config in the right order.
For a basic setup, these small changes to the config file are perfectly fine.
1. A write request comes in. pgpool will dispatch you to node 0 because we
are facing a write.
2. The user clicks on the Save button.
3. The user will reach the next page; a read request will be issued
This can lead to strange behavior on the client's side. A typical case of strange
behavior would be: A user creates a profile. In this case, a row is written. At the very
next moment, the user wants to visit his or her profile and checks the data. If he or
she happens to read from a replica, the data might not be there already. If you are
writing a web application, you must keep this in the back of your mind.
The delay_threshold variable defines the maximum lag, a slave is allowed to have
to still receive reads. The setting is defined in bytes of changes inside the XLOG. So,
if you set this to 1024, a slave is only allowed to be 1 KB of XLOG behind the master.
Otherwise, it will not receive read requests.
[ 181 ]
Working with pgpool
Of course, unless this has been set to zero, it is pretty hard to make it totally impossible
that a slave can ever return data that is too old; however, a reasonable setting can make
it very unlikely. In many practical applications, this may be sufficient.
How does pgpool know how far a slave is behind? The answer is that this can be
configured easily:
If you really want to make sure that load balancing will always
provide you with up-to-date data, it is necessary to replicate
synchronously, which can be expensive.
The pgpool tool can be configured to do load balancing and automatically send
write requests to the first and read requests to the second node.
What happens in case of failover? Let us assume that the master will crash. In this
case, Linux HA would trigger the failover and move the service IP of the master to
the slave. The slave can then be promoted to be the new master by Linux HA (if this
is desired). The pgpool tool would then simply face a broken database connection
and start over and reconnect.
[ 182 ]
Chapter 9
Of course, we can also use Londiste or some other technology such as Slony to
replicate data. However, for the typical case, streaming replication is just fine.
Slony and SkyTools are perfect tools if you want to upgrade all nodes
inside your pgpool setup with a more recent version of PostgreSQL.
You can build Slony or Londiste replicas and then just suspend
operations briefly (to stay in sync) and switch your IP to the host
running the new version.
Practical experience has shown that using PostgreSQL onboard and operating system-
level tools is a good way to handle failovers easily and, more important, reliably.
The first thing you have to do is to add the failover command to your pool
configuration. Here is an example:
failover_command = '/usr/local/bin/pgpool_failover_streaming.sh %d %H
Whenever pgpool detects a node failure, it will execute the script we defined in the
pool configuration and react according to our specifications. Ideally, this failover
script will write a trigger file; this trigger file can then be seen by the slave system
and turn it into a master.
The trigger_file is checked for every 5 seconds. Once the failover occurs, pgpool
can treat the second server as the new master.
The logical next step after a failover is to bring a new server back into the system.
The easiest and most robust way of doing that is to:
[ 183 ]
Working with pgpool
Summary
The pgpool tool is one that has been widely adopted for replication and failover. It
offers a vast variety of features including load balancing, connection pooling, and
replication. The pgpool tool will replicate data on the statement level and integrate
itself with PostgreSQL onboard tools such as streaming replication.
In the next chapter, we will dive into Slony and learn about logical replication. We
will discuss the software architecture and see how Slony can be used to replicate
data within a large server farm.
[ 184 ]
Configuring Slony
Slony is one of the most widespread replication solutions in the field of PostgreSQL.
It is not only one of the oldest replication implementations, it is also one that has the
most extensive support by external tools, such as pgAdmin3. For many years, Slony
was the only viable solution available for the replication of data in PostgreSQL.
Fortunately, the situation has changed and a handful of new technologies
have emerged.
In this chapter, we will take a deep look at Slony, and you will learn how to integrate
Slony into your replication setup. You will also find out which problems you can
solve using Slony. A wide variety of use cases and examples will be covered.
• Installing Slony
• The Slony system architecture
• Replicating tables
• Deploying DDLs
• Handling failovers
• Common pitfalls
Installing Slony
To install Slony, first download the most recent tarball from http://slony.info.
As always, we will perform a source installation so that we are able to
perform replication.
For the purpose of this chapter, we have used the version of Slony available at
http://main.slony.info/downloads/2.2/source/slony1-2.2.4.tar.bz2.
[ 185 ]
Configuring Slony
The tarball will inflate, and we can move forward to compile the code. To build the
code, we must tell Slony where to look for pg_config. The purpose of pg_config
is to provide the add-on module with all of the information about the build process
of PostgreSQL itself. In this way, we can ensure that PostgreSQL and Slony are
compiled in the same way. The pg_config file is quite common and is widely
used by many tools built on PostgreSQL. In our demo setup, PostgreSQL resides
in /usr/local/pgsql, so we can safely assume that PostgreSQL will reside in the
bin directory of the installation:
./configure --with-pgconfigdir=/usr/local/pgsql/bin
make
Once we have executed configure, we can compile the code by calling make. Then
we can switch to root (if PostgreSQL has been installed as root) and install the
binaries on their final destination. The process is again similar to that for other
external tools built on PostgreSQL.
[ 186 ]
Chapter 10
1. The slon daemon will come along and read all changes since the last commit
2. All changes will be replayed on the slaves
3. Once this is done the log can be deleted
Keep in mind that the transport protocol is pure text. The main advantage here is
that there is no need to run the same version of PostgreSQL on every node in the
cluster, because Slony will abstract the version number. We cannot achieve this with
transaction log shipping because, in the case of XLOG-based replication, all nodes in
the cluster must use the same major version of PostgreSQL. The downside to Slony is
that data replicated to a slave has to go through an expensive parsing step. In addition
to that, polling sl_log_* and firing triggers is expensive. Therefore, Slony certainly
cannot be considered a high-performance replication solution.
[ 187 ]
Configuring Slony
The change log is written for certain tables. This also means that we
don't have to replicate all of those tables at the same time. It is possible
to replicate just a subset of those tables on a node.
To make this work, we have to run exactly one slon daemon per database in
our cluster.
Note that we are talking about one slon daemon per database, not
per instance. This is important to take into account when doing the
actual setup.
[ 188 ]
Chapter 10
As each database will have its own slon daemon, these processes will communicate
with each other to exchange and dispatch data. Individual slon daemons can also
function as relays and simply pass data. This is important if you want to replicate
data from database A to database C through database B. The idea here is similar to
what is achievable using streaming replication and cascading replicas.
An important thing about Slony is that there is no need to replicate an entire instance
or an entire database. Replication is always related to a table or a group of tables. For
each table (or for each group of tables), one database will serve as the master, while
as many databases as desired will serve as slaves for that particular set of tables.
It might very well be the case that one database node is the master of tables A and B
and another database is the master of tables C and D. In other words, Slony allows
the replication of data back and forth. Which data has to flow, from where, and to
where will be managed by the slon daemon.
The slon daemon itself consists of various threads serving different purposes,
such as cleanup, listening for events, and applying changes to a server. In addition
to these, it will perform synchronization-related tasks.
To interface with the slon daemon, you can use the command-line tool called
slonik. It will be able to interpret scripts and talk to Slony directly.
Creating the two databases should be an easy task once your instance is up
and running:
hs@hs-VirtualBox:~$ createdb db1
hs@hs-VirtualBox:~$ createdb db2
[ 189 ]
Configuring Slony
Now we can create a table that should be replicated from database db1 to
database db2:
db1=# CREATE TABLE t_test (id serial, name text,
PRIMARY KEY (id));
NOTICE: CREATE TABLE will create implicit sequence "t_test_id_seq" for
serial column "t_test.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "t_test_
pkey" for table "t_test"
CREATE TABLE
Create this table in both the databases in an identical manner, because the table
structure won't be replicated automatically.
Once this has been done, we can write a slonik script to tell the cluster about
our two nodes. The slonik interface is a command-line interface that we use to
talk to Slony directly. You can also work with it interactively, but this is far from
comfortable. In fact, it is really hard, and in the past we have rarely seen people
do this in production environments.
MASTERDB=db1
SLAVEDB=db2
HOST1=localhost
HOST2=localhost
DBUSER=hs
slonik<<_EOF_
cluster name = first_cluster;
# init cluster
[ 190 ]
Chapter 10
The first thing we have to do is define a cluster name. This is important: with Slony,
a cluster is more of a virtual thing, and it is not necessarily related to physical
hardware. We will find out what this means later on, when talking about failovers.
The cluster name is also used to prefix schema names in Slony.
In the next step, we have to define the nodes of this cluster. The idea here is that each
node will have a number associated with a connection string. Once this has been done,
we can call init cluster. During this step, Slony will deploy all of the infrastructure
to perform replication. We don't have to install anything manually here.
Now that the cluster has been initialized, we can organize our tables into replication
sets, which are really just a set of tables. In Slony, we will always work with
replication sets. Tables are grouped into sets and replicated together. This layer of
abstraction allows us to quickly move groups of tables around. In many cases, this
is a lot easier than just moving individual tables one by one.
Finally, we have to define paths. What is a path? A path is basically the connection
string for moving from A to B. The main question here is: why are paths needed
at all? We have already defined nodes earlier, so why define paths? The answer is
that the route from A to B is not necessarily the same as the route from B to A. This
is especially important if one of these servers is in some DMZ while the other one
is not. In other words, by defining paths, you can easily replicate between different
private networks and cross firewalls while performing some NAT, if necessary.
[ 191 ]
Configuring Slony
Slony has done some work in the background. By looking at our test table, we can
see what has happened:
db1=# \d t_test
Table "public.t_test"
Column | Type | Modifiers
--------+---------+-----------------------------------------------------
id | integer | not null default nextval('t_test_id_seq'::regclass)
name | text |
Indexes:
"t_test_pkey" PRIMARY KEY, btree (id)
Triggers:
_first_cluster_logtrigger AFTER INSERT OR DELETE
OR UPDATE ON t_test
FOR EACH ROW EXECUTE PROCEDURE _first_cluster.logtrigger('_first_
cluster', '1', 'k')
_first_cluster_truncatetrigger BEFORE TRUNCATE ON t_test FOR EACH
STATEMENT EXECUTE PROCEDURE _first_cluster.log_truncate('1')
Disabled triggers:
_first_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON t_
test FOR EACH ROW EXECUTE PROCEDURE _first_cluster.denyaccess('_first_
cluster')
_first_cluster_truncatedeny BEFORE TRUNCATE ON t_test FOR EACH
STATEMENT EXECUTE PROCEDURE _first_cluster.deny_truncate()
Now that this table is under Slony's control, we can start replicating it. To do so,
we have to come up with a slonik script again:
#!/bin/sh
MASTERDB=db1
SLAVEDB=db2
HOST1=localhost
HOST2=localhost
[ 192 ]
Chapter 10
DBUSER=hs
slonik<<_EOF_
cluster name = first_cluster;
After stating the cluster name and listing the nodes, we can call subscribe set.
The point here is that in our example, set number 1 is replicated from node 1 to node
2 (receiver). The forward keyword is important to mention here. This keyword
indicates whether or not the new subscriber should store the log information during
replication to make it possible for future nodes to be candidates for the provider.
Any node that is intended to be a candidate for failover must have forward =
yes. In addition to that, this keyword is essential to perform cascaded replication
(meaning A replicates to B and B replicates to C).
If you execute this script, Slony will truncate the table on the slave and reload all of
the data to make sure that things are in sync. In many cases, you may already know
that you are in sync, and you would want to avoid copying gigabytes of data over
and over again. To achieve this, we can add OMIT COPY = yes. This will tell Slony
that we are sufficiently confident that the data is already in sync. This feature has
been supported since Slony 2.0.
After defining what we want to replicate, we can fire up those two slon daemons in
our favorite Unix shell:
$ slon first_cluster 'host=localhost dbname=db1'
$ slon first_cluster 'host=localhost dbname=db2'
This can also be done before we define this replication route, so the order is not of
primary concern here.
Now we can move forward and check whether the replication is working nicely:
db1=# INSERT INTO t_test (name) VALUES ('anna');
INSERT 0 1
db1=# SELECT * FROM t_test;
id | name
----+------
1 | anna
[ 193 ]
Configuring Slony
(1 row)
db1=# \q
hs@hs-VirtualBox:~/slony$ psql db2
psql (9.4.1)
Type "help" for help.
We add a row to the master, quickly disconnect, and query to check whether the
data is already there. If you happen to be quick enough, you will see that the data
comes with a small delay. In our example, we managed to get an empty table just
to demonstrate what asynchronous replication really means.
Let's assume you are running a book shop. Your application connects
to server A to create a new user. Then the user is redirected to a
new page, which queries some information about the new user—be
prepared for the possibility that the data is not there yet on server
B. This is a common mistake in many web applications dealing with
load balancing. The same kind of delay happens with asynchronous
streaming replication.
Deploying DDLs
Replicating just one table is clearly not enough for a productive application.
Also, there is usually no way to ensure that the data structure never changes.
At some point, it is simply necessary to deploy changes of the data structures
(so-called DDLs).
[ 194 ]
Chapter 10
The problem now is that Slony relies heavily on triggers. A trigger can fire when
a row in a table changes. This works for all tables, but it does not work for system
tables. So, if you deploy a new table or happen to change a column, there is no way
for Slony to detect that. Therefore, you have to run a script to deploy changes inside
the cluster to make it work.
MASTERDB=db1
SLAVEDB=db2
HOST1=localhost
HOST2=localhost
DBUSER=hs
slonik<<_EOF_
cluster name = first_cluster;
execute script (
filename = '/tmp/deploy_ddl.sql',
event node = 1
);
_EOF_
The key to our success is execute script. We simply pass an SQL file to the call
and tell it to consult node 1. The content of the SQL file can be quite simple—it
should simply list the DDLs we want to execute:
CREATE TABLE t_second (id int4, name text);
[ 195 ]
Configuring Slony
The table will be deployed on both the nodes. The following listing shows that
the table has also made it to the second node, which proves that things have been
working as expected:
db2=# \d t_second
Table "public.t_second"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
name | text |
Of course, you can also create new tables without using Slony, but this is not
recommended. Administrators should be aware that all DDLs should go through
Slony. The SQL deployment process has to reflect that, which is a major problem
with Slony. For example, adding columns to a table without Slony being aware of
it will definitely end up a disaster.
MASTERDB=db1
SLAVEDB=db2
HOST1=localhost
HOST2=localhost
DBUSER=hs
slonik<<_EOF_
cluster name = first_cluster;
[ 196 ]
Chapter 10
The key to our success is the merge call at the end of the script. It will make sure
that the new tables become integrated into the existing table set. To make sure
that nothing goes wrong, it makes sense to check out the Slony metadata tables
(sl_table) and see which IDs are actually available.
We have created the table without a primary key. This is highly important—there is
no way for Slony to replicate a table without a primary key. So we have to add this
primary key. Basically, we have two choices of how to do this. The desired way here
is definitely to use execute script, just as we have shown before. If your system is
idling, you can also do it the quick and dirty way:
db1=# ALTER TABLE t_second ADD PRIMARY KEY (id);
NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "t_
second_pkey" for table "t_second"
ALTER TABLE
db1=# \q
hs@hs-VirtualBox:~/slony$ psql db2
psql (9.4.1)
Type "help" for help.
However, this is not recommended—it is definitely more desirable to use the Slony
interface to make changes like these. Remember that DDLs are not replicated on
Slony. Therefore, any manual DDL is risky.
[ 197 ]
Configuring Slony
Once we have fixed the data structure, we can execute the slonik script again and
see what happens:
hs@hs-VirtualBox:~/slony$ sh slony_add_to_set.sh
<stdin>:6: PGRES_FATAL_ERROR lock table "_first_cluster".sl_event_lock,
"_first_cluster".sl_config_lock;select "_first_cluster".storeSet(2, 'a
second replication set'); - ERROR: duplicate key value violates unique
constraint "sl_set-pkey"
DETAIL: Key (set_id)=(2) already exists.
CONTEXT: SQL statement "insert into "_first_cluster".sl_set
(set_id, set_origin, set_comment) values
(p_set_id, v_local_node_id, p_set_comment)"
PL/pgSQL function _first_cluster.storeset(integer,text) line 7 at SQL
statement
What you see is a typical problem that you will face with Slony. If something goes
wrong, it can be really, really hard to get things back in order. This is a scenario
you should definitely be prepared for.
So, to fix the problem, we can simply drop the table set again and start from scratch:
#!/bin/sh
MASTERDB=db1
SLAVEDB=db2
HOST1=localhost
HOST2=localhost
DBUSER=hs
slonik<<_EOF_
[ 198 ]
Chapter 10
To kill a table set, we can run drop set. It will help you get back to where you
started. The script will execute cleanly:
hs@hs-VirtualBox:~/slony$ sh slony_drop_set.sh
MASTERDB=db1
SLAVEDB=db2
HOST1=localhost
HOST2=localhost
DBUSER=hs
slonik<<_EOF_
cluster name = first_cluster;
We can now cleanly execute the script, and everything will be replicated as expected:
hs@hs-VirtualBox:~/slony$ sh slony_add_to_set_v2.sh
<stdin>:11 subscription in progress before mergeSet. waiting
<stdin>:11 subscription in progress before mergeSet. waiting
[ 199 ]
Configuring Slony
So, once you have made a change, you should always take a look and check if
everything is working nicely. One simple way to do this is to connect to the db2
database and do this:
db2=# BEGIN;
BEGIN
db2=# DELETE FROM t_second;
ERROR: Slony-I: Table t_second is replicated and cannot be modified on a
subscriber node - role=0
db2=# ROLLBACK;
ROLLBACK
You can start a transaction and try to delete a row. It is supposed to fail. If it
does not, you can safely roll back and try to fix your problem. As you are using a
transaction that never commits, nothing can go wrong.
Performing failovers
Once you have learned how to replicate tables and add them to sets, it is time to
learn about failover. Basically, we can distinguish between two types of failovers:
• Planned failovers
• Unplanned failovers and crashes
Planned failovers
Having planned failovers might be more of a luxury scenario. In some cases,
you will not be so lucky and you will have to rely on automatic failovers or face
unplanned outages.
Basically, a planned failover can be seen as moving a set of tables to some other
node. Once that node is in charge of those tables, you can handle things accordingly.
[ 200 ]
Chapter 10
In our example, we want to move all tables from node 1 to node 2. In addition to this,
we want to drop the first node. Here is the code required:
#!/bin/sh
MASTERDB=db1
SLAVEDB=db2
HOST1=localhost
HOST2=localhost
DBUSER=hs
slonik<<_EOF_
cluster name = first_cluster;
After our standard introduction, we can call move set. The catch here is this:
we have to create a lock to make this work. The reason is that we have to protect
ourselves from changes made to the system when failover is performed. You must
not forget this lock, otherwise you might find yourself in a truly bad situation. Just as
in all of our previous examples, nodes and sets are represented using their numbers.
Once we have moved the set to the new location, we have to wait for the event to be
completed, and then we can drop the node (if desired).
Once we have failed over to the second node, we can delete data. Slony has removed
the triggers that prevent this operation:
db2=# DELETE FROM t_second;
DELETE 1
[ 201 ]
Configuring Slony
The same has happened to the table on the first node. There are no more triggers,
but the table itself is still in place:
db1=# \d t_second
Table "public.t_second"
Column | Type | Modifiers
--------+---------+-----------
id | integer | not null
name | text |
Indexes:
"t_second_pkey" PRIMARY KEY, btree (id)
You can now take the node offline and use it for other purposes.
Unplanned failovers
In the event of an unplanned failover, you may not be so lucky. An unplanned failover
could be a power outage, a hardware failure, or some site failure. Whatever it is, there
is no need to be afraid. You can still bring the cluster back to a reasonable state easily.
This is all that you need to execute on one of the remaining nodes in order to perform
a failover from one node to the other and remove the node from the cluster. It is a
safe and reliable procedure.
[ 202 ]
Chapter 10
Summary
Slony is a widespread tool used to replicate PostgreSQL databases at a logical level.
In contrast to transaction log shipping, it can be used to replicate instances between
various versions of PostgreSQL, and there is no need to replicate an entire instance.
It allows us to flexibly replicate individual objects.
In the next chapter, we will focus our attention on SkyTools, a viable alternative to
Slony. We will cover installation, generic queues, and replication.
[ 203 ]
Using SkyTools
Having introduced Slony, we will take a look at another popular replication tool.
SkyTools is a software package originally developed by Skype, and it serves a
variety of purposes. It is not a single program but a collection of tools and services
that you can use to enhance your replication setup. It offers solutions for generic
queues, a simplified replication setup, data transport jobs, as well as a programming
framework suitable for database applications (especially transport jobs). Covering
all aspects of SkyTools is not possible here. However, in this chapter, the most
important features, concepts, and tools are covered to show you an easy way to
enter this vast world of tools and ideas and benefit as much as possible.
As with all the chapters of this book, the most recent version of SkyTools will be used.
Installing SkyTools
SkyTools is an open source package, and can be downloaded freely from
pgfoundry.org. For the purpose of this chapter, we have used version 3.2
which can be found at http://pgfoundry.org/frs/download.php/3622/
skytools-3.2.tar.gz.
[ 205 ]
Using SkyTools
To install the software, we first need to extract the .tar file and run configure.
The important thing here is that we have to tell configure where to find pg_config.
This is important in order to ensure that SkyTools knows how to compile the code
and where to look for libraries.
Once this has been executed successfully, we can run make and make install (which
might have to be run as root if PostgreSQL has been installed as the root user):
./configure \
--with-pgconfig=/usr/local/pgsql/bin/pg_config
make
sudo make install
Once the code has been compiled, the system is ready for action.
Dissecting SkyTools
SkyTools is not just a single script but a collection of various tools serving
different purposes. Once we have installed SkyTools, it makes sense to inspect
these components in more detail:
• pgq: A generic queuing interface used to flexibly dispatch and distribute data
• londiste: An easy-to-use tool used to replicate individual tables and entire
databases on a logical level
• walmgr: A toolkit used to manage transaction logging
[ 206 ]
Chapter 11
The question is: what is the point of a queue in general? A queue has some very
nice features. First of all, it will guarantee the delivery of a message. In addition
to that, it will make sure that the order in which the messages are put into it is
preserved. This is highly important in the case of replication because we must
make sure that the messages will not overtake each other.
The idea behind a queue is to be able to send anything from an entity producing the
data to any other host participating in the system. This is suitable for replication and
for a lot more. You can use pgq as an infrastructure to flexibly dispatch information.
Practical examples for this can be shopping cart purchases, bank transfers, or user
messages. In this sense, replicating an entire table is more or less a special case. A
queue is really a generic framework used to move data around.
Running pgq
To use pgq inside a database, you have to install it as a normal PostgreSQL extension.
If the installation process has worked properly, you can simply run the following
instruction:
test=# CREATE EXTENSION pgq;
CREATE EXTENSION
Now that all modules have been loaded into the database, we create a simple queue.
[ 207 ]
Using SkyTools
If the queue has been created successfully, a number will be returned. Internally,
the queue is just an entry inside some pgq bookkeeping table:
test=# \x
Expanded display is on.
test=# TABLE pgq.queue;
-[ RECORD 1 ]------------+------------------------------
queue_id | 1
queue_name | DemoQueue
queue_ntables | 3
queue_cur_table | 0
queue_rotation_period | 02:00:00
queue_switch_step1 | 1892
queue_switch_step2 | 1892
queue_switch_time | 2015-02-24 12:06:42.656252+01
queue_external_ticker | f
queue_disable_insert | f
queue_ticker_paused | f
queue_ticker_max_count | 500
queue_ticker_max_lag | 00:00:03
queue_ticker_idle_period | 00:01:00
queue_per_tx_limit |
queue_data_pfx | pgq.event_1
queue_event_seq | pgq.event_1_id_seq
queue_tick_seq | pgq.event_1_tick_seq
[ 208 ]
Chapter 11
The bookkeeping table outlines some essential information about our queue
internals. In this specific example, it will tell us how many internal tables pgq
will use to handle our queue, which table is active at the moment, how often it
is switched, and so on. Practically this information is not relevant to ordinary
users—it is merely an internal thing.
Once the queue has been created, we can add data to the queue. The function used
to do this has three parameters. The first parameter is the name of the queue. The
second and third parameters are data values to enqueue. In many cases, it makes a
lot of sense to use two values here. The first value can nicely represent a key, while
the second value can be seen as the payload of this message. Here is an example:
test=# BEGIN;
BEGIN
test=# SELECT pgq.insert_event('DemoQueue',
'some_key_1', 'some_data_1');
insert_event
--------------
1
(1 row)
test=# COMMIT;
COMMIT
[ 209 ]
Using SkyTools
Adding consumers
In our case, we have added two rows featuring some sample data. Now we can
register two consumers, which are supposed to get those messages in the proper order:
test=# BEGIN;
BEGIN
test=# SELECT pgq.register_consumer('DemoQueue',
'Consume_1');
register_consumer
-------------------
1
(1 row)
test=# COMMIT;
COMMIT
Two consumers have been created. A message will be marked as processed as soon
as both the consumers have fetched it and marked it as done.
Connection 1 Connection 2
INSERT ... VALUES (1) BEGIN;
INSERT ... VALUES (2) BEGIN;
INSERT ... VALUES (3) COMMIT;
INSERT ... VALUES (4) COMMIT;
[ 210 ]
Chapter 11
Remember that if we manage queues, we have to make sure that the total order is
maintained so that we could give only row number 3 to the consumer if the transaction
writing row number 4 had been committed. If we gave row number 3 to the consumer
before the second transaction in connection 1 had finished, row number 3 would have
effectively overtaken row number 2. This must not happen.
In the case of pgq, a so-called ticker process will take care of those little details.
The ticker (pgqd) process will handle the queue for us and decide what can be
consumed. To make this process work, we create two directories. One will hold
log files and the other will store the pid files created by the ticker process:
hs@hs-VirtualBox:~$ mkdir log
hs@hs-VirtualBox:~$ mkdir pid
## optional parameters ##
# libpq connect string without dbname=
base_connstr = host=localhost
## optional timeouts ##
# how often to check for new databases
check_period = 60
[ 211 ]
Using SkyTools
maint_period = 120
As we have already mentioned, the ticker is in charge of those queues. To make sure
that this works nicely, we have to point the ticker to the PostgreSQL instance. Keep
in mind that the connect string will be autocompleted (some information is already
known by the infrastructure, and it is used for auto completion). Ideally, you will be
using the database_list directive here to make sure that only those databases that
are really needed will be taken.
As far as logging is concerned, you have two options here. You can either directly
log to syslog or send the log to a log file. In our example, we have decided not to
use syslog (syslog has been set to 0 in our configuration file). Finally, there
are some parameters used to configure how often queue maintenance should be
performed, and so on.
The important thing here is that the ticker can also be started directly as a daemon.
The –d command-line option will automatically send the process to the background
and decouple it from the active terminal.
Consuming messages
Just adding messages to the queue might not be what we want. At some point,
we would also want to consume this data. To do so, we can call pgq.next_batch.
The system will return a number identifying the batch:
test=# BEGIN;
BEGIN
test=# SELECT pgq.next_batch('DemoQueue', 'Consume_1');
next_batch
------------
1
(1 row)
[ 212 ]
Chapter 11
Once we have got the ID of the batch, we can fetch the data:
test=# \x
Expanded display is on.
test=# SELECT * FROM pgq.get_batch_events(1);
-[ RECORD 1 ]----------------------------
ev_id | 1
ev_time | 2015-02-24 12:15:39.854199+02
ev_txid | 489695
ev_retry |
ev_type | some_key_1
ev_data | some_data_1
ev_extra1 |
ev_extra2 |
ev_extra3 |
ev_extra4 |
-[ RECORD 2 ]----------------------------
ev_id | 2
ev_time | 2015-01-24 12:15:39.854199+02
ev_txid | 489695
ev_retry |
ev_type | some_key_2
ev_data | some_data_2
ev_extra1 |
ev_extra2 |
ev_extra3 |
ev_extra4 |
test=# COMMIT;
COMMIT
In our case, the batch consists of two messages. It is important to know that messages
that have been enqueued in separate transactions or by many different connections
might still end up in the same pack for the consumer. This behavior is totally
intended. The correct order will be preserved.
[ 213 ]
Using SkyTools
Once a batch has been processed by the consumer, it can be marked as done:
test=# SELECT pgq.finish_batch(1);
finish_batch
--------------
1
(1 row)
This means that the data has left the queue. Logically, pgq.get_batch_events will
return an error for this batch ID if you try to dequeue it again:
test=# SELECT * FROM pgq.get_batch_events(1);
ERROR: batch not found
CONTEXT: PL/pgSQL function pgq.get_batch_events(bigint) line 16 at
assignment
The message is gone only for this consumer. Other consumers will
still be able to consume it once.
[ 214 ]
Chapter 11
Using the cursor here allows us to dequeue arbitrary amounts of data from a queue
without any problems and without memory-related issues.
Dropping queues
If a queue is no longer needed, it can be dropped. However, you cannot simply
call pgq.drop_queue. Dropping the queue is possible only if all consumers are
unregistered. Otherwise, the system will error out:
test=# SELECT pgq.drop_queue('DemoQueue');
ERROR: cannot drop queue, consumers still attached
CONTEXT: PL/pgSQL function pgq.drop_queue(text) line 10 at RETURN
[ 215 ]
Using SkyTools
unregister_consumer
---------------------
1
(1 row)
It is important to see that pgq is not just something that is purely related to replication.
It has a much wider range and offers a solid technology base for countless applications.
The main advantage of Londiste over Slony is that in the case of Londiste replication,
there will be one process per "route." So, if you replicate from A to B, this channel
will be managed by one Londiste process. If you replicate from B to A or A to C,
these will be separate processes, and they will be totally independent of each other.
All channels from A to somewhere might share a queue on the consumer, but
the transport processes themselves will not interact. There is some beauty in this
approach because if one component fails, it is unlikely to cause additional problems.
This is not the case if all the processes interact, as they do in the case of Slony.
In my opinion this is one of the key advantages of Londiste over Slony.
[ 216 ]
Chapter 11
In Chapter 10, Configuring Slony, we saw that DDLs are not replicated. The same
rules apply to Londiste because both the systems are facing the same limitations
on the PostgreSQL side.
Before we dig into the details, let's sum up the next steps to replicate our tables:
Let's get started with the first part of the process. We have to create an init file
that is supposed to control the master:
[londiste3]
job_name = first_table
db = dbname=node1
queue_name = replication_queue
logfile = /var/log/londiste.log
pidfile = /var/pid/londiste.pid
[ 217 ]
Using SkyTools
The important part here is that every job must have a name. This makes sense so
that we can distinguish between the processes easily. Then we have to define a
connecting string to the master database as well as the name of the replication
queue involved. Finally, we can configure a PID and a log file.
To install important things and initialize the master node, we can call Londiste:
hs@hs-VirtualBox:~/skytools$ londiste3 londiste3.ini create-root node1
dbname=node1
2015-02-25 13:37:24,902 4236 WARNING No host= in public connect string,
bad idea
2015-02-25 13:37:25,118 4236 INFO plpgsql is installed
2015-02-25 13:37:25,119 4236 INFO Installing pgq
2015-02-25 13:37:25,119 4236 INFO Reading from /usr/local/share/
skytools3/pgq.sql
2015-02-25 13:37:25,327 4236 INFO pgq.get_batch_cursor is installed
2015-02-25 13:37:25,328 4236 INFO Installing pgq_ext
2015-02-25 13:37:25,328 4236 INFO Reading from /usr/local/share/
skytools3/pgq_ext.sql
2015-02-25 13:37:25,400 4236 INFO Installing pgq_node
2015-02-25 13:37:25,400 4236 INFO Reading from /usr/local/share/
skytools3/pgq_node.sql
2015-02-25 13:37:25,471 4236 INFO Installing londiste
2015-02-25 13:37:25,471 4236 INFO Reading from /usr/local/share/
skytools3/londiste.sql
2015-02-25 13:37:25,579 4236 INFO londiste.global_add_table is installed
2015-02-25 13:37:25,670 4236 INFO Initializing node
2015-02-25 13:37:25,674 4236 INFO Location registered
2015-02-25 13:37:25,755 4236 INFO Node "node1" initialized for queue
"replication_queue" with type "root"
2015-02-25 13:37:25,761 3999 INFO Done
In SkyTools, there is a very simple rule: the first parameter passed to the script is
always the INI file containing the desired configuration. Then comes an instruction,
as well as some parameters. The call will install all of the necessary infrastructure
and return Done.
[ 218 ]
Chapter 11
After firing up the worker on the master, we can take a look at the slave
configuration:
[londiste3]
job_name = first_table_slave
db = dbname=node2
queue_name = replication_queue
logfile = /var/log/londiste_slave.log
pidfile = /var/pid/londiste_slave.pid
The main difference here is that we use a different connect string and a different
name for the job. If the master and slave are two separate machines, the rest can
stay the same.
Once we have compiled the configuration, we can create the leaf node:
hs@hs-VirtualBox:~/skytools$ londiste3 slave.ini create-leaf node2
dbname=node2 --provider=dbname=node1
2015-02-25 13:51:27,090 4246 WARNING No host= in public connect string,
bad idea
2015-02-25 13:51:27,117 4246 INFO plpgsql is installed
2015-02-25 13:51:27,118 4246 INFO pgq is installed
2015-02-25 13:51:27,122 4246 INFO pgq.get_batch_cursor is installed
2015-02-25 13:51:27,122 4246 INFO pgq_ext is installed
2015-02-25 13:51:27,123 4246 INFO pgq_node is installed
2015-02-25 13:51:27,124 4246 INFO londiste is installed
2015-02-25 13:51:27,126 4246 INFO londiste.global_add_table is installed
2015-02-25 13:51:27,205 4246 INFO Initializing node
2015-02-25 13:51:27,291 4246 INFO Location registered
2015-02-25 13:51:27,308 4246 INFO Location registered
2015-02-25 13:51:27,317 4246 INFO Subscriber registered: node2
2015-02-25 13:51:27,321 4246 INFO Location registered
2015-02-25 13:51:27,324 4246 INFO Location registered
2015-02-25 13:51:27,334 4246 INFO Node "node2" initialized for queue
"replication_queue" with type "leaf"
2015-02-25 13:51:27,345 4246 INFO Done
[ 219 ]
Using SkyTools
The key here is to tell the slave where to find the master (provider). Once the system
knows where to find all of the data, we can fire up the worker here as well:
hs@hs-VirtualBox:~/skytools$ londiste3 slave.ini worker
2015-02-25 13:55:10,764 4301 INFO Consumer uptodate = 1
This should not cause any issues and work nicely if the previous command has
succeeded. Now that we have everything in place, we can start with the final
component of the setup—the ticker process:
[pgqd]
logfile = /home/hs/log/pgqd.log
pidfile = /home/hs/pid/pgqd.pid
The ticker config is pretty simple—all it takes is three simple lines. These lines are
enough to fire up the ticker process. All of the relevant information is nicely stored
in the .ini file and can, therefore, be used by SkyTools:
hs@hs-VirtualBox:~/skytools$ pgqd pgqd.ini
2015-02-25 14:01:12.181 4683 LOG Starting pgqd 3.2
2015-02-25 14:01:12.188 4683 LOG auto-detecting dbs ...
2015-02-25 14:01:12.310 4683 LOG test: pgq version ok: 3.2
2015-02-25 14:01:12.531 4683 LOG node1: pgq version ok: 3.2
2015-02-25 14:01:12.596 4683 LOG node2: pgq version ok: 3.2
2015-02-25 14:01:42.189 4683 LOG {ticks: 90, maint: 3, retry: 0}
2015-02-25 14:02:12.190 4683 LOG {ticks: 90, maint: 0, retry: 0}
If the ticker has started successfully, it means that we have the entire infrastructure
in place. So far, we have configured all the processes needed for replication, but we
have not yet told the system what to replicate.
The command for Londiste will offer us a set of commands to define exactly that.
In our example, we simply want to add all tables and replicate them:
hs@hs-VirtualBox:~/skytools$ londiste3 londiste3.ini add-table --all
2015-02-25 14:02:39,367 4760 INFO Table added: public.t_test
Just like Slony, Londiste will install a trigger on the source (not on the slave as Slony
does), which keeps track of all the changes. Those changes will be written to a pgq
queue and dispatched by the processes we have just set up:
node1=# \d t_test
Table "public.t_test"
[ 220 ]
Chapter 11
SkyTools and Londiste in particular offer a rich set of additional features to make
your life easy. However, documenting all of those features would unfortunately
exceed the scope of this book. If you want to learn more, we recommend taking a
detailed look at the doc directory inside the SkyTools source code. You will find a
couple of interesting documents.
Setting up streaming has become so easy that add-ons are not as important anymore
as they used to be. Of course, this is our subjective observation, which might not be
what you may have observed in the recent past.
To make sure that the scope of this book does not explode, we have decided not to
include any details about walmgr in this chapter. For further information, we invite
you to review the documentation directory inside the walmgr source code. It contains
some easy-to-use examples as well as some background information about the
replication technique.
[ 221 ]
Using SkyTools
Summary
In this chapter, we discussed SkyTools, a tool package provided and developed
by Skype. It provides you with generic queues (pgq) as well as a replicator called
Londiste. In addition to these, SkyTools provides walmgr, which can be used to
handle WAL files.
The next chapter will focus on Postgres-XC, a solution capable of scaling reads as
well as writes. It provides users with a consistent view of data and automatically
dispatches queries inside the cluster.
[ 222 ]
Working with Postgres-XC
In this chapter, we have to focus our attention on a write-scalable, multimaster,
synchronous, symmetric, and transparent replication solution for PostgreSQL, called
PostgreSQL eXtensible Cluster (Postgres-XC). The goal of this project is to provide the
end user with a transparent replication solution, which allows higher levels of loads
by horizontally scaling to multiple servers.
In an array of servers running Postgres-XC, you can connect to any node inside the
cluster. The system will make sure that you get exactly the same view of the data
on every node. This is really important, as it solves a handful of problems on the
client side. There is no need to add logic to applications that write to just one node.
You can balance your load easily. Data is always instantly visible on all nodes after
a transaction commits. Postgres-XC is wonderfully transparent, and application
changes are not required for scaling out.
The most important thing to keep in mind when considering Postgres-XC is that it is
not an add-on to PostgreSQL; it is a code fork. Some years ago, the code was forked
because adding massive functionality, as Postgres-XC does, is not easy in the main
code base of PostgreSQL. Postgres-XC does not use Vanilla PostgreSQL version
numbers, and the code base usually lags behind the official PostgreSQL source tree.
Note that there is also a second project out there, called Postgres-XL. It shares many
similarities with Postgres-XC, but it is not the same thing and represents one more
code fork developed by a different group of people.
This chapter will provide you with information about Postgres-XC. We will cover
the following topics in this chapter:
[ 223 ]
Working with Postgres-XC
• Optimizing storage
• Performance management
• Adding and dropping nodes
• Data nodes
• A Global Transaction Manager (GTM)
• A coordinator
• A GTM proxy
Data nodes
A data node is the actual storage backbone of the system. It will hold all or a fraction
of the data inside the cluster. It is connected to the Postgres-XC infrastructure and
will handle the local SQL execution.
[ 224 ]
Chapter 12
GTM
The GTM provides the cluster with a consistent view of the data. A consistent
view is necessary because otherwise, it would be impossible to load-balance
in an environment that is totally transparent to the application.
Beside this core functionality, the GTM also handles global values for sequences
and global transaction IDs.
Coordinators
Coordinators are a piece of software that serve as entry points for our applications. An
application will connect to one of the available Coordinators. It will be in charge of the
SQL parsing, analysis, creation of the global execution plan, and global SQL execution.
GTM proxy
The GTM proxy can be used to improve performance. Given the Postgres-XC
architecture, each transaction has to issue a request to the GTM. In many cases, this
can lead to latency, and subsequently performance issues. The GTM proxy will step in,
collect requests to the GTM in the form of blocks of requests, and send them together.
One advantage here is that connections can be cached to avoid a great deal of
overhead caused by their opening and closing all the time.
Installing Postgres-XC
Postgres-XC can be downloaded from http://sourceforge.net/projects/
postgres-xc/files/Version_1.2/. For this book, we have used version 1.2.1 of
Postgres-XC, which is based on PostgreSQL 9.3.2.
[ 225 ]
Working with Postgres-XC
To compile the code, we have to extract the code using the following command:
tar xvfz pgxc-v1.2.1.tar.gz
Then we can compile the code just like standard PostgreSQL using the following
commands:
cd postgres-xc-1.2.1
./configure --prefix=/usr/local/postgres-xc
make
make install
Once this has been executed, we can move ahead and configure the cluster.
Keep in mind that, to make life simple, we will set up the entire cluster on a
single server. In production, you would logically use different nodes for different
components. It is not very common to use many XC nodes on the same box to
speed things up.
[ 226 ]
Chapter 12
gtm -D /data/gtm
or
gtm_ctl -Z gtm -D /data/gtm -l logfile start
Don't expect anything great and magical from initgtm. It merely creates some
basic configuration needed for handling the GTM. It does not create a large
database infrastructure there.
However, it already gives us a clue on how to start the GTM, which will be done
later on in the process.
Then we have to initialize those four database nodes that we want to run. To do
so, we have to run initdb, just as for any Vanilla PostgreSQL database instance.
However, in the case of Postgres-XC, we have to tell initdb what name the node
will have. In our case, we will create the first node called node1 in the node1
directory. Each node will need a dedicated name. This is done as follows:
[hs@localhost data]$ initdb -D /data/node1/ --nodename=node1
The files belonging to this database system will be owned by user "hs".
… *snip* ...
Success.
You can now start the database server of the Postgres-XC coordinator
using:
postgres --coordinator -D /data/node1/
or
pg_ctl start -D /data/node1/ -Z coordinator -l logfile
You can now start the database server of the Postgres-XC datanode using:
postgres --datanode -D /data/node1/
or
pg_ctl start -D /data/node1/ -Z datanode -l logfile
[ 227 ]
Working with Postgres-XC
We can call initdb for all the four instances that we will run. To make
sure that those instances can coexist in the very same test box, we have
to change the port for each of those boxes. In our example, we will
simply use the following ports: 5432, 5433, 5434, and 5435.
To change the port, just edit the port setting in the postgresql.conf file of each
instance. Also, make sure that each instance has a different socket_directory
directory, otherwise you cannot start more than one instance.
Now that all the instances have been initialized, we can start the GTM. This works
as follows:
[hs@localhost ~]$ gtm_ctl -D /data/gtm/ -Z gtm start
server starting
To see whether it works, we can check for the process, like this:
[hs@localhost ~]$ ps ax | grep gtm
32152 pts/1 S 0:00 /usr/local/postgres-xc/bin/gtm -D /data/gtm
You can also use pgrep instead of a normal grep. Then we can start all of those
nodes one after the other.
In our case, we will use one of those four nodes as the Coordinator. The Coordinator
will be using port 5432. To start it, we can call pg_ctl and tell the system to use this
node as a Coordinator:
pg_ctl -D /data/node1/ -Z coordinator start
The remaining nodes will simply act as data nodes. We can easily define the role of
a node on startup:
pg_ctl -D /data/node2/ -Z datanode start
pg_ctl -D /data/node3/ -Z datanode start
pg_ctl -D /data/node4/ -Z datanode start
Once this has been done, we can check whether those nodes are up
and running.
[ 228 ]
Chapter 12
We are almost done now, but before we can get started, we have to familiarize those
nodes with each other. Otherwise, we will not be able to run queries or commands
inside the cluster. If the nodes don't know each other, an error will show up:
[hs@localhost ~]$ createdb test -h localhost -p 5432
createdb: database creation failed: ERROR: No Datanode defined in
cluster
HINT: You need to define at least 1 Datanode with CREATE NODE.
To tell the systems about the locations of the nodes, we connect to the Coordinator
and run the following instructions:
[hs@localhost ~]$ psql postgres -h localhost -p 5432
psql (PGXC , based on PG 9.3.2)
Type "help" for help.
Once those nodes are familiar with each other, we can connect to the Coordinator
and execute whatever we want. In our example, we will simply create a database:
[hs@localhost ~]$ psql postgres -p 5432 -h localhost
psql (PGXC , based on PG 9.3.2)
Type "help" for help.
[ 229 ]
Working with Postgres-XC
Sure, you can just load data and things will work, but if performance is really an
issue, you should try to think of how you can use your data. Keep in mind there
is no point in using a distributed database system if your database system is not
heavily loaded. So, if you are a user of Postgres-XC, we expect your load and your
requirements to be high.
[ 230 ]
Chapter 12
The DISTRIBUTE BY clause allows you to specify where to store a table. If you want
to tell Postgres-XC that a table has to be on every node in the cluster, we recommend
using REPLICATION. This is especially useful if you are creating a small lookup table
or a table that is frequently used in many queries.
If the goal is to scale out, it is recommended to split a table into a list of nodes. But
why would anybody want to split the table? The reason is actually quite simple. If
you give full replicas of a table to all the data nodes, it actually means that you
will have one write per node. Clearly, this is not more scalable than a single node
because each node has to take all of the load. For large tables facing heavy writes,
it can therefore be beneficial to split the table into various nodes. Postgres-XC
offers various ways to do this.
The ROUNDROBIN command will just spread the data more or less randomly, HASH
will dispatch data based on a hash key, and MODULO will simply evenly distribute
data, given a key.
Keep in mind that a node group is static; you cannot add nodes to it later on. So if
you start organizing your cluster, you have to think beforehand which areas your
cluster will have.
[ 231 ]
Working with Postgres-XC
In addition to that, it is pretty hard to reorganize the data once it has been
dispatched. If a table is spread to, say, four nodes, you cannot easily add a fifth node
to handle the table. First of all, adding a fifth node will require a rebalance. Secondly,
most of those features are still under construction and not yet fully available for end
users. At this point, Postgres-XC has no satisfactory means of growing large clusters
and shrinking them.
Optimizing joins
Dispatching your data cleverly is essential if you want to join it. Let's assume a
simple scenario consisting of three tables:
Let's assume that we have to join this data frequently. In this scenario, it is highly
recommended to partition t_person and t_person_payment by the same join key.
Doing that will allow Postgres-XC to join and merge a lot of stuff locally on the
data nodes, instead of having to ship data around within the cluster. Of course,
we can also create full replicas of the t_person table if this table is read so often
that it makes sense to do so.
When coming up with proper partitioning logic, we just want to remind you of a
simple rule: Try to perform calculations locally; that is, try to avoid moving data
around at any cost.
[ 232 ]
Chapter 12
We also recommend fully replicating fairly small lookup tables so that as much
work as possible can be performed on those data nodes. What does small mean in
this context? Let's imagine that you are storing information about millions of people
around the world. You might want to split data across many nodes. However, it will
clearly not make sense if you split the list of potential countries on different nodes.
The number of countries on this planet is limited, so it is simply more viable to have
a copy of this data on all the nodes.
To work around this issue, we can introduce a GTM proxy. The idea is that
transaction IDs will be requested in larger blocks. The core idea is that we want to
avoid network traffic and, especially, latency. The concept is pretty similar to how
grouping commits in PostgreSQL works.
How can a simple GTM proxy be set up? First of all, we have to create a directory
where the config is supposed to exist. Then we can make the following call:
initgtm -D /path_to_gtm_proxy/ -Z gtm_proxy
This will create a config sample, which we can adapt easily. After defining a node
name, we should set gtm_host and gtm_port to point to the active GTM. Then we
can tweak the number of worker threads to a reasonable value to make sure that we
can handle more load. Usually, we configure the GTM proxy in such a way that the
number of worker threads match the number of nodes in the system. This has been
proven to be a robust configuration.
[ 233 ]
Working with Postgres-XC
Once the table has been created, we can add data to it. After completion, we can
check whether the data has been written correctly to the cluster:
test=# SELECT count(*) FROM t_test;
count
-------
1000
(1 row)
The interesting thing here is to see how the data is returned by the database engine.
Let's take a look at the execution plan of our query:
test=# explain (VERBOSE TRUE, ANALYZE TRUE,
NODES true, NUM_NODES true)
SELECT count(*) FROM t_test;
QUERY PLAN
-------------------------------------------------------
Aggregate (cost=2.50..2.51 rows=1 width=0)
(actual time=5.967..5.970 rows=1 loops=1)
Output: pg_catalog.count(*)
-> Materialize (cost=0.00..0.00 rows=0 width=0)
(actual time=5.840..5.940 rows=3 loops=1)
Output: (count(*))
->Data Node Scan (primary node count=0,
[ 234 ]
Chapter 12
node count=3) on
"__REMOTE_GROUP_QUERY__"
(cost=0.00..0.00 rows=1000 width=0)
(actual time=5.833..5.915 rows=3 loops=1)
Output: count(*)
Node/s: node2, node3, node4
Remote query: SELECT count(*) FROM
(SELECT id FROM ONLY t_test
WHERE true) group_1
Total runtime: 6.033 ms
(9 rows)
PostgreSQL will perform a so-called data node scan. This means that PostgreSQL will
collect data from all the relevant nodes in the cluster. If you look closely, you can see
which query will be pushed down to those nodes inside the cluster. The important
thing here is that the count is already shipped to the remote node. All the counts
coming back from our nodes will then be rounded into a single count. This kind of
plan is much more complex than a simple local query, but it can be very beneficial as
the amount of data grows. This is because each node will perform only a subset of the
operation. The fact that each node performs only a subset of operations is especially
useful when many things are running in parallel.
The Postgres-XC optimizer can push down operations to the data nodes in many
cases, which is good for performance. However, you should still keep an eye on
your execution plans to make sure they are reasonable.
Adding nodes
Postgres-XC allows you to add new servers to the setup at any point in the process.
All you have to do is set up a node as we have seen before and call CREATE NODE
on the controller. The system will then be able to use this node.
However, there is one important point about this: if you have partitioned a table
before adding a new node, this partitioned table will stay in its place. Some people
may expect that Postgres-XC magically rebalances this data to new nodes. User
intervention is necessary for rebalancing. It is your task to move new data there
and make good use of the server.
It is necessary for Postgres-XC to behave in this way because adding a new node
would lead to unexpected behavior otherwise.
[ 235 ]
Working with Postgres-XC
Rebalancing data
Since Postgres-XC 1.2, there has been a new feature in Postgres-XC. It allows you to
add a server to the infrastructure during normal operations. While adding the node
itself is easy, rebalancing data is not so easy.
However, there is a catch—Postgres-XC will lock up the table, move stuff to the
coordinates, and redispatch data from there. While this is no problem for small
tables, it is a serious issue for large tables, because it will take a while until data
is nicely resharded and dispatched again.
In addition to this, it can be a capacity issue in the Coordinator part of the
infrastructure. If a multi-TB table is resharded, the storage on the Coordinator
might not be large enough. The main issue is as follows: the more important
sharding is, the more critical resharding really is.
If you decide to use Postgres-XC, it is important to think ahead and start with
enough shards to make things work flawlessly. Otherwise, unexpected downtimes
might occur. For analytical processes, this might be okay, but it is definitely not okay
for large-scale OLTP operations. The main question now is, "How can we figure out
how many nodes are needed?" This is not as easy as it might sound. In many cases,
testing empirically is the best method to avoid oversizing or undersizing.
The answer is pretty simple—Postgres-XC will not be able to perform the requests
by making use of failed nodes. This can result in a problem for both reads and writes.
A query trying to fetch from a failed node will return an error indicating that no
connection is available. At this point, Postgres-XC is not yet very fault tolerant.
Hopefully, this will change in the future.
[ 236 ]
Chapter 12
For you as a user, this means that if you are running Postgres-XC, you have to come
up with a proper failover and High Availability (HA) strategy for your system. We
recommend creating replicas of all the nodes to make sure that the controller can
always reach an alternative node if the primary data node fails. Linux HA is a good
option to make nodes fail-safe and achieve fast failovers.
From Postgres-XC 1.2 onwards, nodes can be added and removed during production.
However, keep in mind that resharding might be necessary before this, which is still
a major issue.
If you want to perform this kind of operation, you have to make sure that you are
a superuser. Normal users are not allowed to remove a node from the cluster.
Whenever you drop a node, make sure that there is no more data on it that might be
usable by you. Removing a node is simply a change inside Postgres-XC's metadata,
so the operation will be quick and the relevant data will be removed from your view
of the data. Postgres-XC will not prevent you from shooting yourself in the foot.
One issue is how to actually figure out the location of the data. Postgres-XC has
a set of system tables that allow you to retrieve information about nodes, data
distribution, and so on. The following example shows how a table can be created
and how we can figure out where it is:
test=# CREATE TABLE t_location (id int4)
DISTRIBUTE BY REPLICATION;
CREATE TABLE
test=# SELECT node_name, pcrelid, relname
FROM pgxc_class AS a, pgxc_node AS b,
pg_class AS c
WHERE a.pcrelid = c.oid
[ 237 ]
Working with Postgres-XC
In our case, the table has been replicated to all the nodes.
There is one tricky thing you have to keep in mind when dropping nodes: if you drop
a name and recreate it with the same name and connection parameters, it will not be
the same. When a new node is created, it will get a new object ID. In PostgreSQL, the
name is not as important as the ID of an object. This means that even if you drop a
node accidentally and recreate it using the same name, you will still face problems.
Of course, you can always magically work around it by tweaking the system tables,
but this is not what you should do.
Therefore, we highly recommend being very cautious when dropping nodes from
a production system. It is never really recommended to do this.
To make sure that the GTM cannot be a single point of failure, you can use a GTM
standby. Configuring a GTM standby is not hard to do. All you have to do is create
a GTM config on a spare node and set a handful of parameters in gtm.conf:
startup = STANDBY
active_host = 'somehost.somedomain.com'
active_port = '6666'
synchronous_backup = off
First of all, we have to set the startup parameter to STANDBY. This will tell the
GTM to behave as a slave. Then we have to tell the standby where to find the
main productive GTM. We do so by adding a hostname and a port.
[ 238 ]
Chapter 12
To start the standby, we can use gtm_ctl again. This time, we use the –Z gtm_
standby variable to mark the node as standby.
Summary
In this chapter, we dealt with Postgres-XC, a distributed version of PostgreSQL
that is capable of horizontal partitioning and query distribution. The goal of the
Postgres-XC project is to have a database solution capable of scaling out writes
transparently. It offers a consistent view of the data and various options for
distributing data inside the cluster.
The next chapter will cover PL/Proxy, a tool used to shard PostgreSQL database
systems. You will learn how to distribute data to various nodes and shard it to
handle large setups.
[ 239 ]
Scaling with PL/Proxy
Adding a slave here and there is really a nice scalability strategy, which is basically
enough for most modern applications. Many applications will run perfectly well
with just one server; you might want to add a replica to add some security to the
setup, but in many cases, this is pretty much what people need.
If your application grows larger, you can, in many cases, just add slaves and scale
out reading. This too is not a big deal and can be done quite easily. If you want to
add even more slaves, you might have to cascade your replication infrastructure,
but for 98 percent of all applications, this is going to be enough.
In those rare, leftover 2 percent of cases, PL/Proxy can come to the rescue. The idea
of PL/Proxy is to be able to scale out writes. Remember that transaction-log-based
replication can scale out only reads; there is no way to scale out writes.
If you want to scale out writes, turn to PL/Proxy. However, before this
is done, consider fixing your database system by adding missing indexes
and other entries. Work through the output of pg_stat_statements
and see what can be done to avoid adding dozens of servers for no
reason. This strategy has proven to be effective many times before.
[ 241 ]
Scaling with PL/Proxy
The question now is: how can you ever scale out writes? To do so, we have to
follow an old Roman principle, which has been widely applied in warfare: "Divide
et impera" (in English: "Divide and conquer"). Once you manage to split a problem
into many small problems, you are always on the winning side.
Applying this principle to the database work means that we have to split writes and
spread them to many different servers. The main aim here is to split up data wisely.
As an example, we simply assume that we want to split user data. Let's assume
further that each user has a username to identify themselves.
How can we split data now? At this point, many people would suggest splitting
data alphabetically somehow. Say, everything from A to M goes to server 1, and
the rest to server 2. This is actually a pretty bad idea because we can never assume
that data is evenly distributed. Some names are simply more likely than others, so
if you split by letters, you will never end up with roughly the same amount of data
in each partition (which is highly desirable). However, we definitely want to make
sure that each server has roughly the same amount of data, and find a way to extend
the cluster to more boxes easily. Anyway, we will talk about useful partitioning
functions later on.
We take PL/Proxy and install it on a server that will act as the proxy for our system.
Whenever we run a query, we ask the proxy to provide us with the data. The proxy
will consult its rules and figure out which server the query has to be redirected to.
Basically, PL/Proxy is a way of sharding a database instance.
[ 242 ]
Chapter 13
The way of asking the proxy for data is by calling a stored procedure.
You have to go through a procedure call in standard PostgreSQL.
So, if you issue a query, PL/Proxy will try to hide a lot of complexity from you and
just provide you with the data, no matter where it comes from.
Of course, there are many ways to split the data. In this section, we will take a look
at a simple, yet useful, way. This way can be applied to many different scenarios.
Let's assume for this example that we want to split data and store it in an array of
16 servers. This is a good number because 16 is a power of 2. In computer science,
powers of 2 are usually good numbers, and the same applies to PL/Proxy.
The key to evenly dividing your data depends on first turning your text value into
an integer:
test=# SELECT 'www.postgresql-support.de';
?column?
---------------------------
www.postgresql-support.de
(1 row)
We can use a PostgreSQL built-in function (not related to PL/Proxy) to hash texts.
It will give us an evenly distributed integral number. So, if we hash 1 million rows,
we will see evenly distributed hash keys. This is important because we can split data
into similar chunks.
[ 243 ]
Scaling with PL/Proxy
Now we can take this integer value and keep just the lower 4 binary bits. This will
allow us to address 16 servers:
test=# SELECT hashtext('www.postgresql-support.de')::bit(4);
hashtext
----------
0100
(1 row)
The final four bits are 0100, which are converted back to an integer. This means
that this row is supposed to reside on the fifth node (if we start counting from 0).
Using hash keys is by far the simplest way of splitting data. It has some nice
advantages, as in, if you want to increase the size of your cluster, you can easily
add just one bit without having to rebalance the data inside the cluster.
Of course, you can always come up with more complicated and sophisticated rules
to distribute the data.
Setting up PL/Proxy
After this brief theoretical introduction, we can move forward and run some
simple PL/Proxy setups. To do so, we simply install PL/Proxy and see how it
can be utilized.
Installing PL/Proxy is an easy task. First of all, we have to download the source
code from http://pgfoundry.org/frs/?group_id=1000207. Of course, you can
also install binary packages if prebuilt packages are available for your operating
system. However, in this section, we will simply perform an installation from
source and see how things work on a very basic level:
1. The first step in the installation process is to unpack the TAR archive.
This can easily be done using the following command:
tar xvfz plproxy-2.5.tar.gz
2. Once the TAR archive has been unpacked, we can enter the newly
created directory and start the compilation process by simply calling
make && make install.
[ 244 ]
Chapter 13
3. If you want to make sure that your installation is really fine, you can also
run make installcheck. It runs some simple tests to make sure that your
system is operating correctly.
A basic example
To get you started, we want to set up PL/Proxy in such a way that we can fetch
random numbers from all the four partitions. This is the most basic example. It
will show all the basic concepts of PL/Proxy.
To enable PL/Proxy, we have to load the extension into the database first:
test=# CREATE EXTENSION plproxy;
CREATE EXTENSION
This will install all of the relevant code and infrastructure you need to make PL/
Proxy work. Then we want to create four databases, which will carry the data we
want to partition:
test=# CREATE DATABASE p0;
CREATE DATABASE
test=# CREATE DATABASE p1;
CREATE DATABASE
test=# CREATE DATABASE p2;
CREATE DATABASE
test=# CREATE DATABASE p3;
CREATE DATABASE
Once we have created these databases, we can run CREATE SERVER. The question is:
what is SERVER? Well, in this context, you can see SERVER as some kind of remote
data source providing you with the data you need. SERVER is always based on a
module (in our case, PL/Proxy) and may carry a handful of options. In the case of
PL/Proxy, these options are just a list of partitions. There can be some additional
parameters as well, but the list of nodes is by far the most important thing here:
CREATE SERVER samplecluster FOREIGN DATA WRAPPER plproxy
OPTIONS ( partition_0 'dbname=p0 host=localhost',
[ 245 ]
Scaling with PL/Proxy
Once we have created the server, we can move ahead and create ourselves a nice
user mapping. The purpose of a user mapping is to tell the system which user we
are going to be on the remote data source. It might very well be the case that we are
user A on the proxy, but user B on the underlying database servers. If you are using
a foreign data wrapper to connect to, say, Oracle, this will be essential. In the case
of PL/Proxy, it is quite often the case that users on the partitions and the proxy
are simply the same.
If we are working as a superuser on the system, this will be enough. If we are not
a superuser, we have to grant permission to the user who is supposed to use our
virtual server. We have to grant the USAGE permission to get this done:
GRANT USAGE ON FOREIGN SERVER samplecluster TO hs;
To see whether our server has been created successfully, we can check out the
pg_foreign_server system table. It holds all of the relevant information about
our virtual server. Whenever you want to figure out which partitions are present,
you can simply consult the system table and inspect srvoptions:
test=# \x
Expanded display is on.
test=# SELECT * FROM pg_foreign_server;
-[ RECORD 1 ]
srvname | samplecluster
srvowner | 10
srvfdw | 16744
srvtype |
srvversion |
srvacl | {hs=U/hs}
srvoptions | {"partition_0=dbname=p0 host=localhost","partition_1=dbname
=p1 host=localhost","partition_2=dbname=p2 host=localhost","partition_3=d
bname=p3 host=localhost"}
[ 246 ]
Chapter 13
The procedure is just like an ordinary stored procedure. The only special thing here
is that it has been written in PL/Proxy. The CLUSTER keyword will tell the system
which cluster to take. In many cases, it can be useful to have more than just one
cluster (maybe if different datasets are present on different sets of servers).
Then we have to define where to run the code. We can run on ANY (any server), ALL (on
all servers), or a specific server. In our example, we have decided to run on all servers.
The most important thing here is that when the procedure is called, we will get one
row per node because we have used RUN ON ALL. In the case of RUN ON ANY, we
would have got only one row because the query would have been executed on any
node inside the cluster:
test=# SELECT * FROM get_random();
get_random
-------------------
0.879995643626899
0.442110917530954
0.215869579929858
0.642985367681831
(4 rows)
[ 247 ]
Scaling with PL/Proxy
To demonstrate how this works, we have to distribute user data to our four databases.
In the first step, we have to create a simple table on all four databases inside the cluster:
p0=# CREATE TABLE t_user (
username text,
password text,
PRIMARY KEY (username)
);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "t_user_
pkey" for table "t_user"
CREATE TABLE
Once we have created the data structure, we can come up with a procedure to actually
dispatch data into this cluster. A simple PL/Proxy procedure will do the job:
CREATE OR REPLACE FUNCTION create_user(name text,
pass text) RETURNS void AS $$
CLUSTER 'samplecluster';
RUN ON hashtext($1);
$$ LANGUAGE plproxy;
The point here is that PL/Proxy will inspect the first input parameter and run a
procedure called create_user on the desired node. RUN ON hashtext($1) will
be our partitioning function. So, the goal here is to find the right node and execute
the same procedure there. The important part is that on the desired node, the
create_user procedure won't be written in PL/Proxy but simply in SQL, PL/
pgSQL, or any other language. The only purpose of the PL/Proxy function is to
find the right node to execute the underlying procedure.
The procedure on each of the nodes that actually puts the data into the table is
pretty simple:
CREATE OR REPLACE FUNCTION create_user(name text,
pass text)
RETURNS void AS $$
INSERT INTO t_user VALUES ($1, $2);
$$ LANGUAGE sql;
[ 248 ]
Chapter 13
Once we have deployed this procedure on all the four nodes, we can give it a try:
SELECT create_user('hans', 'paul');
The PL/Proxy procedure in the test database will hash the input value and figure
out that the data has to be on p3, which is the fourth node:
p3=# SELECT * FROM t_user;
username | password
----------+----------
hans | paul
(1 row)
The following SQL statement will reveal why the fourth node is correct:
test=# SELECT hashtext('hans')::int4::bit(2)::int4;
hashtext
----------
3
(1 row)
Keep in mind that we will start counting at 0, so the fourth node is actually
number 3.
In our example, we executed a procedure on the proxy and relied on the fact that a
procedure with the same name will be executed on the slave. But what if you want
to call a procedure on a proxy that is supposed to execute some other procedure in
the desired node? For mapping a proxy procedure to some other procedure, there
is a command called TARGET.
[ 249 ]
Scaling with PL/Proxy
The best thing you can do is to create more partitions than needed straightaway.
So, if you consider getting started with four nodes or so, you have to create 16
partitions straightaway and run four partitions per server. Extending your cluster
will be pretty easy in this case using following:
To replicate those existing nodes, you can simply use the technologies outlined
in this book, such as streaming replication, londiste, or Slony.
The main point here is: how can you tell PL/Proxy that a partition has moved
from one server to some other server?
ALTER SERVER samplecluster
OPTIONS (SET partition_0
'dbname=p4 host=localhost');
[ 250 ]
Chapter 13
In this case, we have moved the first partition from p0 to p4. Of course, the partition
can also reside on some other host; PL/Proxy will not care which server it has to go
to to fetch the data. You just have to make sure that the target database has all the
tables in place and that PL/Proxy can reach this database.
Adding partitions is not difficult on the PL/Proxy side either. Just as before, you can
simply use ALTER SERVER to modify your partition list, as follows:
ALTER SERVER samplecluster
OPTIONS (
ADD partition_4 'dbname=p5 host=localhost',
ADD partition_5 'dbname=p6 host=localhost',
ADD partition_6 'dbname=p7 host=localhost',
ADD partition_7 'dbname=p8 host=localhost');
• You can make your partitioning function cleverer and make it treat the new
data differently from the old data. However, this can be error-prone and will
add a lot of complexity and legacy to your system.
• A cleverer way is if you increase the size of your cluster, we strongly
recommend doubling the size of the cluster straightaway. The beauty of this
is that you need just one more bit in your hash key; if you move from four to,
say, five nodes, there is usually no way to grow the cluster without having to
move a large amount of data around. You would want to avoid rebalancing
data at any cost.
[ 251 ]
Scaling with PL/Proxy
To protect yourself against these kinds of issues, you can always turn to streaming
replication and Linux HA. An architecture of PL/Proxy might look as follows:
Each node can have its own replica, and therefore, we can fail over each node
separately. The good thing about this infrastructure is that you can scale out your
reads, while improving availability at the same time.
The machines in the read-only area of the cluster will provide your system with
some extra performance.
One important thing about PL/Proxy is that you cannot simply use foreign keys out of
the box. Let's assume you have got two tables that are in a 1:n relationship. If you want
to partition the right side of the equation, you will also have to partition the other side.
Otherwise, the data you want to reference will simply end up in some other database.
An alternative would simply be to fully replicate the referenced tables.
[ 252 ]
Chapter 13
In general, it has been proven to be fairly beneficial to just avoid foreign key
implementations because you need a fair amount of trickery to get foreign keys
right. A workaround would be to use functions inside the CHECK constraints,
checking the child tables for the corresponding value.
What you have to do is run CREATE EXTENSION with a target version and define
the version you want to upgrade from. All the rest will happen behind the
scenes automatically.
If you want to upgrade a database partition from one PostgreSQL version to the
next major release (minor releases will only need a simple restart of the node), it
will be a bit more complicated. When running PL/Proxy, it is safe to assume that
the amount of data in your system is so large that performing a simple dump/reload
is out of the question, because you would simply face far too much downtime.
Just as we have seen before, you can move a partition to a new server by calling
ALTER SERVER. In this way, you can replicate and upgrade one server after the
other and gradually move to a more recent PostgreSQL version in a risk-free and
downtime-free manner.
[ 253 ]
Scaling with PL/Proxy
Summary
In this chapter, we discussed an important topic—PL/Proxy. The idea behind PL/
Proxy is to have a scalable solution to shard PostgreSQL databases. It has been
widely adopted by many companies around the globe and it allows you to scale
writes as well as reads.
[ 254 ]
Scaling with BDR
In this chapter, you will be introduced to a brand new technology called BDR.
Bidirectional Replication (BDR), is definitely one of the rising stars in the world of
PostgreSQL. A lot of new stuff will be seen in this area in the very near future, and
people can expect a thriving project.
Before digging into all the technical details, it is important to understand the
fundamental technical aspects of BDR.
BDR has been invented to put an end to trigger-based solutions and turn PostgreSQL
into a more robust, more scalable, and easier way to administer solutions. Trigger-
based replication is really a thing of the past, and should not be seen in a modern
infrastructure anymore. You can safely bet on BDR—it is a long-term, safe solution.
[ 255 ]
Scaling with BDR
This definition is actually so good and simple that it makes sense to include it here.
The idea of eventual consistency is that data is not immediately the same on all
nodes, but after some time, it actually will be (if nothing happens).
Handling conflicts
Given the consistency model, an important topic pops up: what about conflicts? In
general, all systems using the eventually consistent model need some sort of conflict
resolution. The same applies to BDR.
The beauty of BDR is that conflict management is pretty flexible. By default, BDR
offers an unambiguous conflict resolution algorithm, which defines that the last
update always wins.
One more advantage of BDR is that conflicting changes can be logged in a table.
In other words, if a conflict happens, it is still possible to reverse engineer what
was really going on in the system. Data is not silently forgotten but preserved
for later investigation.
[ 256 ]
Chapter 14
When talking about conflicts, the use case of BDR has to be kept in mind; BDR has
been designed as a (geographically) distributed database system, which allows you
to handle a lot of data. From a consistency point of view, there is one thing that
really has to be kept in mind: how likely is it that a guy in New York and a guy in
Rome change the very same row at the very same time on their nodes? If this kind
of conflict is the standard case, BDR is really not the right solution. However, if
a conflict almost never happens (which is most likely the case in 99 percent of all
applications), BDR is really the way to go. Remember that a conflict can happen
only if the same row is changed by many people at the same time, or if a primary
key is violated. It does not happen if two people touch two rows that are completely
independent of each other. Eventually consistent is therefore a reasonable choice for
some use cases.
Distributing sequences
Sequences are a potential source of conflicts. Imagine hundreds of users adding
data to the very same table at the very same time. Any auto-increment column
will instantly turn out to be a real source of conflicts, as instances tend to assign
the same or similar numbers all over again.
Therefore, BDR offers support for distributed sequences. Each node is assigned not
just as single value but a range of values, which can then be used until the next block
of values is assigned. The availability of distributed sequences greatly reduces the
number of potential conflicts and helps your system run way more smoothly than
otherwise possible.
Handling DDLs
Data structures are by no means constant. Once in a while, it can happen that
structures have to be changed. Maybe a table has to be added, or a column has
to be dropped, and so on.
BDR can handle this nicely. Most DDLs are just replicated, as they are just sent to
all nodes and executed. However, there are also commands that are forbidden.
Here are two of the more prominent ones:
ALTER TABLE ... ALTER COLUMN ... USING();
ALTER TABLE ... ADD COLUMN ... DEFAULT;
Keep in mind that conflicts will be very likely anyway under concurrency, so not
having those commands is not too much of an issue. In most cases, the limitations
are not critical. However, they have to be kept in mind.
In many cases, there is a workaround for those commands, such as setting explicit
values, and other ways.
[ 257 ]
Scaling with BDR
What does modifying in one node actually mean? Let's suppose there are three
locations: Vienna, Berlin, and London. Somebody working in Vienna is more likely
to modify Austrian data than a German working in Berlin. The German operator is
usually modifying data of German clients rather than data of Austrians or Britons.
The Austrian operator is most likely going to change Austrian data. Every operator
in every country should see all the data of the company. However, it is way more
likely that data is changed where it is created. From a business point of view, it is
basically impossible that two folks located in different countries change the same
data at the same time—conflicts are not likely.
One more bad scenario for BDR is that if all the nodes need to see exactly the same
data at the very same time, BDR has a hard time.
BDR can write data quite efficiently. However, it is not able to scale writes infinitely
because at this point, all the writes still end up on every server. Remember that
scaling writes can only happen by actually splitting the data. So, if you are looking
for a solution that can scale out writes better, PL/Proxy might be a better choice.
[ 258 ]
Chapter 14
The entire XLOG decoding thing basically goes on behind the scenes; end users
don't notice it.
Installing BDR
Installing BDR is easy. The software is available as a source package, and it can
be deployed directly as a binary package. Of course, it is also possible to install the
code from source. However, it is likely that this procedure might change, as more
and more changes are moved into the core of PostgreSQL. Therefore, I have decided
to skip the source installation.
[ 259 ]
Scaling with BDR
Once BDR has been installed on all the nodes, the system is ready for action.
Note that all the data nodes will be on the same physical box in order to make life
easier for beginners.
Arranging storage
The first thing administrators have to do is create some space for PostgreSQL.
In this simplistic example, just three directories are created:
[root@localhost ~]# mkdir /data
[root@localhost ~]# mkdir /data/node1 /data/node2 /data/node3
Make sure that these directories belong to postgres (if PostgreSQL is really
running as postgres, which is usually a good idea):
[root@localhost ~]# cd /data/
[root@localhost data]# chown postgres.postgres node*
Once these directories have been created, everything needed for a successful
setup is in place:
[root@localhost data]# ls -l
total 0
drwxr-xr-x 2 postgres postgres 6 Apr 11 05:51 node1
drwxr-xr-x 2 postgres postgres 6 Apr 11 05:51 node2
drwxr-xr-x 2 postgres postgres 6 Apr 11 05:51 node3
[ 260 ]
Chapter 14
Then the three database instances can be created. The initdb command can be
used just as in the case of any ordinary instance:
[postgres@localhost ~]$ initdb -D /data/node1/ -A trust
[postgres@localhost ~]$ initdb -D /data/node2/ -A trust
[postgres@localhost ~]$ initdb -D /data/node3/ -A trust
To make the setup process more simplistic, trust will be used as the authentication
method. User authentication is, of course, possible but it is not the core topic of this
chapter, so it is better to simplify this part as much as possible.
Now that the three database instances have been created, it is possible to adjust
postgresql.conf. The following parameters are needed:
shared_preload_libraries = 'bdr'
wal_level = 'logical'
track_commit_timestamp = on
max_connections = 100
max_wal_senders = 10
max_replication_slots = 10
max_worker_processes = 10
The first thing to do is to actually load the BDR module into PostgreSQL. It contains
some important infrastructure for replication. In the next step, logical decoding has
to be enabled. It will be the backbone of the entire infrastructure.
Then max_wal_senders has to be set along with replication slots. These settings are
needed for streaming replication as well, and should not come as a big surprise.
[ 261 ]
Scaling with BDR
Now that postgresql.conf has been configured, it is time to focus our attention on
pg_hba.conf. In the most simplistic case, simple replication rules have to be created:
Note that in a real, productive setup, a wise administrator would configure a special
replication user and set a password, or use some other authentication method. For
simplicity reasons, this process has been left out here.
[ 262 ]
Chapter 14
These extensions must be loaded into all three databases created before. It is not
enough to load them into just one component. It is vital to load them into all databases.
Finally, our database nodes have to be put into a BDR group. So far, there are only
three independent database instances, which happen to contain some modules. In
the next step, these nodes will be connected to each other.
(1 row)
Basically, two parameters are needed: the local name and a database connection to
connect to the node from a remote host. To define local_node_name, it is easiest to
just give the node a simple name.
To check whether the node is ready for BDR, call the following function. If the
answer is as follows, it means that things are fine:
test=# SELECT bdr.bdr_node_join_wait_for_ready();
bdr_node_join_wait_for_ready
------------------------------
(1 row)
(1 row)
[ 263 ]
Scaling with BDR
Again, a NULL value is a good sign. First, the second node is added to BDR. Then the
third node can join as well:
test=# SELECT bdr.bdr_group_join(
local_node_name := 'node3',
node_external_dsn := 'port=5434 dbname=test',
join_using_dsn := 'port=5432 dbname=test'
);
bdr_group_join
----------------
(1 row)
Once all the nodes have been added, the administrator can check whether all of
them are ready:
[postgres@localhost node2]$ psql test -p 5433
test=# SELECT bdr.bdr_node_join_wait_for_ready();
bdr_node_join_wait_for_ready
------------------------------
(1 row)
(1 row)
If both queries return NULL, it means that the system is up and running nicely.
[ 264 ]
Chapter 14
As you can see, each instance will have at least three BDR processes. If these processes
are around, it is usually a good sign and replication should work as expected.
The table looks as expected. There is just one exception: a TRUNCATE trigger has been
created automatically. Keep in mind that replication slots are able to stream INSERT,
UPDATE, and DELETE statements. DDLs and TRUNCATE are now row-level information,
so those statements are not in the stream. The trigger will be needed to catch
TRUNCATE and replicate it as plain text. Do not attempt to alter or drop the trigger.
[ 265 ]
Scaling with BDR
In this example, the value has been added to the instance listening to 5432. A quick
check reveals that the data has been nicely replicated to the instance listening to 5433
and 5434:
[postgres@localhost ~]$ psql test -p 5434
test=# TABLE t_test;
id | t
----+----------------------------
1 | 2015-04-11 08:48:46.637675
(1 row)
Handling conflicts
As already stated earlier in this chapter, conflicts are an important thing when
working with BDR. Keep in mind that BDR has been designed as a distributed
system, so it only makes sense to use it when conflicts are not very likely. However,
it is important to understand what happens in the event of a conflict.
To run the test, a simple SQL query is needed. In this example, 10,000 UPDATE
statements are used:
[postgres@localhost ~]$ head -n 3 /tmp/script.sql
UPDATE t_counter SET id = id + 1;
UPDATE t_counter SET id = id + 1;
UPDATE t_counter SET id = id + 1;
The nasty part now is that this script is executed three times, once on each node:
[postgres@localhost ~]$ cat run.sh
#!/bin/sh
psql test -p 5432 < /tmp/script.sql > /dev/null &
psql test -p 5433 < /tmp/script.sql > /dev/null &
psql test -p 5434 < /tmp/script.sql > /dev/null &
[ 266 ]
Chapter 14
As the same row is hit again and again, the number of conflicts is expected
to skyrocket.
Note that this is not what BDR has been built for in the first place. It is
just a demo meant to show what happens in the event of a conflict.
Once those three scripts have been completed, it is already possible to check out
what has happened on the conflict side:
test=# \x
Expanded display is on.
test=# TABLE bdr.bdr_conflict_history LIMIT 1;
-[ RECORD 1 ]------------+------------------------------
conflict_id | 1
local_node_sysid | 6136360318181427544
local_conflict_xid | 0
local_conflict_lsn | 0/19AAE00
local_conflict_time | 2015-04-11 09:01:23.367467+02
object_schema | public
object_name | t_counter
remote_node_sysid | 6136360353631754624
remote_txid | 1974
remote_commit_time | 2015-04-11 09:01:21.364068+02
remote_commit_lsn | 0/1986900
conflict_type | update_delete
conflict_resolution | skip_change
local_tuple |
remote_tuple | {"id":2}
local_tuple_xmin |
local_tuple_origin_sysid |
error_message |
error_sqlstate |
error_querystring |
error_cursorpos |
error_detail |
error_hint |
error_context |
[ 267 ]
Scaling with BDR
error_columnname |
error_typename |
error_constraintname |
error_filename |
error_lineno |
error_funcname |
BDR provides a simple and fairly easy way to read the table that contains all the
conflicts row after row. On top of the LSN, the transaction ID and some more
information relevant to the conflict are shown. In this example, BDR has done a
skip_change resolution. Remember that every change will hit the same row, as
we are asynchronous multi-master, which is very nasty for BDR. In this example,
UPDATE statements have indeed been skipped; this is vital to understand. BDR can
skip changes in your cluster in the event of a conflict or concurrency.
Understanding sets
So far, the entire cluster has been used. Everybody was able to replicate data to
everybody else. In many cases, this is not what is desired. BDR provides quite
some flexibility in this area.
Unidirectional replication
BDR can perform not only bidirectional replication but also unidirectional
replication. In some cases, this can be very handy. Consider a system that is just
there to serve some reads. A simple unidirectional slave might be what you need.
[ 268 ]
Chapter 14
The setup process is fairly simple and fits nicely into BDR's fundamental
design principles.
This sets the replication sets of a table. The previous assignment will be overwritten.
If you want to figure out which replication sets a table belongs to, the following
function can be called:
bdr.table_get_replication_sets(relation regclass) text[]
Controlling replication
For maintenance reasons, it might be necessary to hold and resume replication from
time to time. Just imagine a major software update. It might do something nasty to
your data structure. You definitely don't want faulty stuff to be replicated to your
entire system. Therefore, it can come in handy to stop replication briefly and restart
it once things have proven to be fine.
Connections to the remote node (or nodes) are retained, but no data is read from
them. The effects of pausing apply are not persistent, so the replay will resume if
PostgreSQL is restarted or the postmaster carries out crash recovery after a backend
crash. Terminating individual backend using pg_terminate_backend will not cause
a replay to resume, nor will reloading the postmaster, without a full restart. There is
no option to pause a replay from only one peer node.
[ 269 ]
Scaling with BDR
Summary
BDR is a rising star in the world of PostgreSQL replication. Currently, it is still
under heavy development, and we can expect a lot more in the very near future
(maybe even when you are holding this book in your hands).
[ 270 ]
Working with Walbouncer
In the final chapter of this book, you will be guided through a tool released in 2014,
called walbouncer. Most of the techniques covered in this book have explained how
to replicate an entire database instance, how to shard, and more. In the final chapter,
which is about walbouncer, it's all about filtering the stream of the transaction log
to selectively replicate database objects from one server to a set of (not necessarily
identical) slaves.
The trouble is that it is necessary to replicate entire database instances. In many real-
world scenarios, this can be a problem. Let's assume there is a central server containing
information about students studying in many universities. Each university should
have a copy of its data. As of PostgreSQL 9.4, this was not possible using a single
database instance because streaming replication is only capable of fully replicating
an instance. Running many instances is clearly a lot more work and, maybe, not the
desired methodology.
[ 271 ]
Working with Walbouncer
The idea behind walbouncer is to connect to the PostgreSQL transaction log and filter
it. In this scenario, a slave will receive only a subset of the data, thus filtering out all of
the data that may be critical from legal or a security point of view. In the case of our
university example, each university would only have a replica of its own database, and
therefore, there is no way to see data from other organizations. Hiding data is a huge
step forward when it comes to securing systems. There might also be a use case for
walbouncer for the purpose of sharding.
The walbouncer tool is a process sitting between the master and the slaves. It connects
to the master, fetches the transaction log, and filters it before it is transmitted to the
slaves. In this example, two slaves are available for consuming the transaction log
just like regular slaves.
[ 272 ]
Chapter 15
Filtering XLOG
The core question now is: how is it possible for walbouncer to filter the transaction
log? Remember that transaction log positions are highly critical, and fiddling around
with those transaction log positions is not trivial, dangerous, and in many cases, not
feasible at all.
The solution to this problem lies deep in the core of PostgreSQL. The core knows how
to handle dummy transaction log entries. To cut out all of the data that is not destined
to reach a certain box, dummy XLOG is injected, replacing the original one. The slave
can now safely consume the XLOG and ignore those dummy records. The beauty of
this technique is in the fact that the master and slave can stay unchanged—no patching
is needed. All of the work can be done exclusively by walbouncer, which simply acts
as some sort of XLOG proxy.
• Each slave will receive the same number of XLOG records, regardless of
the amount of data that has actually changed in the target system
• Metadata (system tables) has to be replicated completely and must not be
left out
• The target system will still see that a certain database is supposed to exist,
but using it will simply fail
The last item especially deserves some attention. Remember that the system cannot
analyze the semantics of a certain XLOG records; all it does is check whether it is
needed or not. Therefore, metadata has to be copied as it is. When a slave system
tries to read filtered data, it will receive a nasty error indicating that data files are
missing. If a query cannot read a file from the disk, it will naturally indicate error
and roll back. This behavior might be confusing for some people, but it is the only
possible way to solve the underlying technical problem.
Installing walbouncer
The walbouncer tool can be downloaded freely from the Cybertec website
(http://www.cybertec.at/postgresql_produkte/walbouncer/) and
installed using the following steps:
1. For the purpose of this book, the following file has been used:
wget http://cybertec.at/download/walbouncer-0.9.0.tar.bz2
[ 273 ]
Working with Walbouncer
3. Once the package has been extracted you can enter the directory. Before
calling make, it is important to check for missing libraries. Make sure that
support for YAML is around. On my CentOS test system, the following
command does the job:
[root@localhost ~]# yum install libyaml-devel
5. In the next step, just call make. The code compiles cleanly. Finally, only make
install is left:
As you can see, deploying walbouncer is easy, and all it takes is a few commands.
Configuring walbouncer
Once the code has been deployed successfully, it is necessary to come up with
a simple configuration to tell walbouncer what to do. To demonstrate how
walbouncer works, a simple setup has been created here. In this example, two
databases exist on the master. Only one of them should end up on the slave:
$ createdb a
$ createdb b
Before getting started with a base backup, it makes sense to back up a walbouncer
config. A basic configuration is really simple and easy to perform:
listen_port: 5433
master:
host: localhost
port: 5432
configurations:
- slave1:
[ 274 ]
Chapter 15
filter:
include_databases: [a]
All of these steps have already been described in Chapter 4, Setting Up Asynchronous
Replication. As already mentioned, the tricky part is the initial base backup. Assuming
that the a database has to be replicated, it is necessary to find its object ID:
test=# SELECT oid, datname FROM pg_database WHERE datname = 'a';
oid | datname
-------+---------
24576 | a
(1 row)
[ 275 ]
Working with Walbouncer
In this example, the object ID is 24576. The general rule is as follows: all databases
with OIDs larger than 16383 are databases created by users. These are the only
databases that can be filtered in a useful way.
Now go to the data directory of the master and copy everything but the base directory
to the directory where your slave resides. In this example, this little trick is applied:
there are no filenames starting with a, so it is possible to safely copy everything
starting with c and add the backup label to the copy process. Now the system
will copy everything but the base directory:
cp -Rv [c-z]* backup_label ../slave/
Once everything has been copied to the slave, which happens to be on the same box
in this example, the missing base directory can be created in the slave directory:
$ mkdir ../slave/base
In the next step, all the databases that are needed can be copied over. Currently,
the situation on the sample master is as follows:
[hs@localhost base]$ ls -l
total 72
drwx------ 2 hs hs 8192 Feb 24 11:53 1
drwx------ 2 hs hs 8192 Feb 24 11:52 13051
drwx------ 2 hs hs 8192 Feb 25 11:49 13056
drwx------ 2 hs hs 8192 Feb 25 11:48 16384
drwx------ 2 hs hs 8192 Feb 25 10:32 24576
drwx------ 2 hs hs 8192 Feb 25 10:32 24577
All OIDs larger than 16383 are databases created by end users. In this scenario,
there are three such databases.
[ 276 ]
Chapter 15
The important thing is that in most setups, the base directory is by far the largest
directory. Once the data has been secured, the backup can be stopped, as follows:
test=# SELECT pg_stop_backup();
NOTICE: WAL archiving is not enabled; you must ensure that all required
WAL segments are copied through other means to complete the backup
pg_stop_backup
----------------
0/2000238
(1 row)
The most important thing here is that the port of the slave has to be put into the
config file. The slave won't see its master directly anymore, but will consume all
of its XLOG through walbouncer.
Firing up walbouncer
Once the configuration has been completed, it is possible to fire up walbouncer.
The syntax of walbouncer is pretty simple:
$ walbouncer --help
[ 277 ]
Working with Walbouncer
Default localhost
-P, --masterport=PORT Connect to master on this port.
Default 5432
-p, --port=PORT Run proxy on this port. Default 5433
-v, --verbose Output additional debugging information
All of the relevant information is in the config file anyway, so a simple startup
is ensured:
$ walbouncer -v -c config.ini
[2015-02-25 11:56:57] wbsocket.c INFO: Starting socket on port 5433
The -v option is not mandatory. All it does is to provide us with a bit more
information about what is going on. Once the starting socket message shows
up, it means that everything is working perfectly.
Finally, it's time to start the slave server. Go to the slave's data directory and fire it
up, as follows:
slave]$ pg_ctl -D . -o "--port=5444" start
server starting
LOG: database system was shut down in recovery at 2015-02-25 11:59:12
CET
LOG: entering standby mode
LOG: redo starts at 0/2000060
LOG: record with zero length at 0/2000138
INFO: WAL stream is being filtered
DETAIL: Databases included: a
LOG: started streaming WAL from primary at 0/2000000 on timeline 1
LOG: consistent recovery state reached at 0/2000238
LOG: database system is ready to accept read only connections
The . here means local directory (of course, putting the full path is also a good
idea—definitely). Then comes one more trick: when syncing a slave over and over
again for testing purposes on the same box, it can be a bit annoying to change the
port over and over again. The -o option helps overwrite the config file settings in
postgresql.conf so that the system can be started directly on some other port.
[ 278 ]
Chapter 15
As soon as the slave comes to life, walbouncer will start issuing more logging
information, telling us more about the streaming going on:
[2015-02-25 11:58:37] wbclientconn.c INFO: Received conn from
0100007F:54729
[2015-02-25 11:58:37] wbclientconn.c DEBUG1: Sending authentication
packet
[2015-02-25 11:58:37] wbsocket.c DEBUG1: Conn: Sending to client 9 bytes
of data
[2015-02-25 11:58:37] wbclientconn.c INFO: Start connecting to
host=localhost port=5432 user=hs dbname=replication replication=true
application_name=walbouncer
FATAL: no pg_hba.conf entry for replication connection from host "::1",
user "hs"
Once those messages are displayed, the system is up and running and the
transaction log is flowing from the master to the slave.
a=#
The next test checks whether the data is nicely replicated from the master to the
slave. To perform the check, a table can be created on the master:
a=# CREATE TABLE a (aid int);
CREATE TABLE
[ 279 ]
Working with Walbouncer
The system is now ready for action and can be used safely.
The first example shows what can be done if more than one slave is around:
listen_port: 5433
master:
host: localhost
port: 5432
configurations:
- slave1:
match:
application_name: slave1
filter:
include_tablespaces: [spc_slave1]
exclude_databases: [test]
Inside the configuration block, there is the slave1 section. The slave1 configuration
will be chosen if a slave connects itself using slave1 as application_name (as
outlined in the application_name clause). If this configuration is chosen by the server
(there can be many such slave sections), the filters listed in the next block are applied.
Basically, each type of filter comes in two incarnations: include_ and exclude_. In
this example, only the spc_slave1 tablespace is included. The final setting says that
only test is excluded (all other databases are included if the tablespace filter
matches them).
[ 280 ]
Chapter 15
In this case, all the tablespaces but spc_slave1 are included. Only system databases
and the database test are replicated. Given those include_ and exclude_ settings,
it is possible to flexibly configure what to replicate to which slave.
As you can see, application_name really serves two purposes here: it decides which
config block to use, and tells the master which level of replication is required.
• pg_class: This table contains a list of objects (tables, indexes, and so on).
It is important to fetch the object ID from this table.
• pg_namespace: This is used to fetch information about schemas.
• pg_inherit: Information about inheritance used in the table.
There is no general guideline on how to find all of those objects, because things
depend highly on the types of filters applied.
[ 281 ]
Working with Walbouncer
The most simplistic way to come up with the proper SQL queries to find those objects
(files)—which have to be deleted—is to use the -E option along with psql. If psql is
started with -E, it shows the SQL code behind all of those backslash commands. The
SQL code on the frontend can come in handy. Here is some example output:
test=# \q
hs@chantal:~$ psql test -E
psql (9.4.1)
Type "help" for help.
test=# \d
********* QUERY **********
SELECT n.nspname as "Schema",
c.relname as "Name",
CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN
'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN
's' THEN 'special' WHEN 'f' THEN 'foreign table' END as "Type",
pg_catalog.pg_get_userbyid(c.relowner) as "Owner"
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','v','m','S','f','')
AND n.nspname <> 'pg_catalog'
AND n.nspname <> 'information_schema'
AND n.nspname !~ '^pg_toast'
AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1,2;
**************************
List of relations
Schema | Name | Type | Owner
--------+--------------------+-------+-------
...
Once the files have been deleted from the base directory, walbouncer and the slave
instance can be restarted. Keep in mind that walbouncer is a tool used to make
normal streaming more powerful. Therefore, slaves are still read-only, and it is not
possible to use commands such as DELETE and DROP. You really have to delete files
from disk.
[ 282 ]
Chapter 15
Simply use the mechanism described earlier in this chapter to avoid all pitfalls.
Summary
In this chapter, the working of walbouncer, a tool used to filter transaction logs,
was discussed. In addition to the installation process, all the configuration options
and a basic setup were outlined.
[ 283 ]
Index
A BDR group
creating 263
archive BDR replication
archive_command, checking 118 about 255
checking 117 eventual consistency, defining 256
transaction log archive, monitoring 118 logical decoding 259
archive_command use cases 258
%f 66 Bidirectional Replication. See BDR
%p 66
asynchronous replication C
versus synchronous replication 6
Atomicity, Consistency, Isolation, Call Detail Records (CDRs) 48
and Durability (ACID) 30 CAP theorem
availability about 2
about 129, 130 Availability 2
measuring 128, 129 Consistency 2
defining 2, 3
B Partition tolerance 2
cascaded replication
bandwidth 5 configuring 88, 89
base backups changelog triggers 187
creating, traditional methods used 73 checkpoints
network bandwidth, observing 74 bulk loading 47
pg_basebackup, using 69 checkpoint_segments, defining 45
tablespace issues 74 checkpoint_timeout, defining 45
taking 68, 69 configuring 45
base directory data, writing 46
about 23, 24 defining 45
data files, increasing 24 I/O spikes 47, 48
I/O, performing in chunks 25 stock market data, storing 46
relation forks 25 tuning 44
BDR checkpoint_timeout parameter 46
about 255 check_postgres
binary packages, installing 259, 260 URL 123
installing 259 cleanup
URL 259 defining 97
[ 285 ]
cluster D
availability, increasing 251
configuring 226 data
conflicts, handling 266-268 replicating 7
database instances, creating 261, 262 database
extending 250 configuring 44
foreign keys, managing 252 replicating 189-194
GTM, creating 226-230 data distribution
handling 250 using 11
modules, loading 262-264 data loss
partitions, adding 250 about 38
partitions, moving 250 defining 7
PL/Proxy nodes, upgrading 253 data node scan 235
setting up 260 DDLs
setup, checking 264, 265 about 194
starting 262-264 deploying 194, 195
storage, arranging 260 delayed replicas
commit log asynchronous replication 102
about 25 crashes, handling 103
states 25 defining 102
common maintenance tasks Point-in-time Recovery (PITR) 102
cluster-wide maintenance, performing 148 delay_threshold variable 181
defining 147 designated controller (DC) 145
failed PostgreSQL starts, recovering 148 DISCARD ALL command 160
failover, forcing 148 disk
maintenance, performing single node 147 batteries, defining 42
resynchronizing, after master failure 149 defining 39
components, SkyTools from memory, to disk 40-42
londiste 206 from memory, to memory 39, 40
pgq 206 fsync function 42
walmgr 206 DISTRIBUTE BY clause 231
configuration components, walbouncer downtime 128
configurations 275 durability 129, 130
listen_port 275
master 275 E
configure command 206
conflict management error scenarios
defining 98-100 about 93
consistency 38 corrupted XLOG, in archive 94
consistent hashing master, rebooting 94
about 16 network connection, dead 93
advantages 17 slave, rebooting 94
Corosync 132 eventual consistency
corosync-keygen tool 132 about 256
createdb tool 179 conflicts, handling 256, 257
[ 286 ]
DDLs, handling 257 Global Transaction IDs (GXIDs) 225
sequences, distributing 257 GTM
URL 256 about 225
creating 226-230
F
H
failovers
dealing with 182 HA cluster
handling 236 clustering software, configuring 136, 137
performing 200 cluster resources, configuring 140-143
planned failovers 200-202 configuring 139
unplanned failovers 202 constraints, configuring 143, 144
failures fencing, setting up 144, 145
detecting 130 PostgreSQL installation, preparing 137, 138
fence agents 132, 133 servers, preparing 135
fencing 131 setting up 135
fields, pg_stat_replication setup, verifying 145, 146
application_name 120 software, installing 136
backend_start 120 standby, syncing 139
backend_xmin 120 High Availability (HA)
client_addr 120 about 127, 128, 237
client_hostname 120 dealing with 182
client_port 120 need for 128
flush_location 121
pid 120 I
replay_location 121
sent_location 120 INSERT statement
state 120 about 30, 31
sync_priority 121 advantages 31
sync_state 121 crashing, after WAL writing 32
usename 120 crashing, during WAL writing 31, 32
usesysid 120
write_location 120 L
filtering rules, walbouncer
lag 6
adjusting 281
latency
objects, adding to slaves 283
about 4
objects, filtering 281, 282
defining 5
objects, removing 281, 282
long-distance transmission 5
fsync function 42
Linux-HA
about 132
G Corosync 132
Generalized Inverted Index (GIN) 50 fence agents 133
global directory Pacemaker 132
about 25 PCS 133
standalone data files, defining 25 PostgreSQL resource agent
(PG RA) 133, 134
[ 287 ]
resource agents 133 nodes
using 182 adding 235
load balancing data, rebalancing 236
setting up 173-175 dropping 236
logical decoding GTM standby, running 238, 239
about 56 replacing 237, 238
advantages 259
logical replication O
using 10
versus physical replication 9, 10 one row of data
logical replication slots INSERT statement 30, 31
about 56-59 writing 30
replication identities, configuring 59, 60 Online Transaction Processing (OLTP) 233
Logical Sequence Number (LSN) 51 Open Cluster Framework (OCF) 133
log shipping 105 operating system processes
londiste checking for 121, 122
advantages 216 options, PostgreSQL
first table, replicating 217-221 local 44
used, for replicating data 216 off 43
on 43
M remote_write 44
master configuration 92 P
masters
slaves, turning to 90, 91 Pacemaker 132
max_standby_archive_delay parameter 99 performance implications 113, 114
max_standby_streaming_delay performance optimization
parameter 100 defining 230
max_wal_senders parameter 84 GTM proxy, creating 233
modules joins, optimizing 232
installing 170, 171 optimizing, for warehousing 232
monitoring tools tables, dispatching 231
check_postgres, installing 124 performance, PgBouncer
dealing with 123 admin interface, configuring 163
monitoring strategy, deciding 124, 125 improving 160
monitor operation 142 maintaining 162
multimaster replication management database, using 163
versus single-master replication 8 operations, resuming 167
Murphy's law operations, suspending 167
defining 127 pgbench, using 161, 162
runtime information, extracting 164-166
N performance, XLOG 38
pg_basebackup
node failovers backup throttling 72
handling 236, 237 features 70-72
node groups 231 master server, signaling 70
pg_hba.conf, modifying 69, 70
[ 288 ]
self-sufficient backups 73 queues, creating 208, 209
syntax 70 queues, dropping 215
using 69 running 207
pg_basebackup tool 74 ticker, configuring 210-212
PgBouncer using 216
cleanup issues 159 pgsql resource agents
connecting to 157 semantics 146
defining 152 pg_stat_replication
installing 152, 153 checking 119
Java issues 158 fields, defining 120, 121
pool modes 158 physical replication
pool sizes, handling 156 slots 54, 55
requests, dispatching 154, 155 using 10
settings 155 versus logical replication 9, 10
simple config file, writing 153, 154 physics
URL 151, 152 limitations 4, 5
PgBouncer setup PITR
configuring 153 architecture, defining 64-66
pgpool features 64
about 169 need for 64
additional modules, installing 170, 171 planned failovers 200-202
architecture 172, 173 PL/Proxy
downloading 169 about 242
features, defining 171, 172 basic concepts 241, 242
firing up 176 basic example 245-247
hosts, attaching 177, 178 data, partitioning 243, 244
installing 169, 170 installing 244
password authentication 176 partitioned reads and writes 247-249
running, with streaming setting up 244, 245
replication 180, 181 Point-in-time Recovery. See PITR
setup, testing 176 pool modes
sr_check_password 182 about 158
sr_check_period 182 session mode 159
sr_check_user 182 statement mode 159
URL 169 transaction mode 159
pgpool configuration pool sizes, PgBouncer
optimizing, for master/slave mode 181, 182 authentication 157
pgpool mechanisms default_pool_size 156
for failover 183, 184 handling 156
for High Availability 183, 184 max_client_conn 156
pgq.get_batch_events command 214 min_pool_size 156
pgq queues pool_size 156
consumers, adding 210 reserve_pool_size 156
data, adding 208, 209 PostgreSQL
managing 206, 207 consistency levels 43
messages, consuming 212-214 data, writing 21
[ 289 ]
disk layout 22 Postgres-XC
format, XLOG 36, 37 defining 223
fsync parameter 43 installing 225
one row of data, writing 30 URL 225
read consistency 33 Postgres-XC architecture
synchronous_commit parameter 43 coordinators 225
PostgreSQL 9.4 data nodes 224
commands 88 defining 224
postgresql.conf parameter 158 GTM 225
PostgreSQL database system GTM proxy 225
architecture 33 practical implications 113, 114
variables 36 problems, Slony
PostgreSQL disk layout managing 196-200
base directory 23, 24
data directory, defining 22, 23 Q
global directory 25
pg_clog, commit log 25 queries
pg_hba.conf file 26 issuing 234, 235
pg_ident.conf file 26 queue, operations
pg_logical directory 26 dequeue 207
pg_multixact file 26 enqueue 207
pg_notify directory 27 quorum 7, 131
pg_replslot directory 27
pg_serial directory 27 R
pg_snapshot file 27
random() function 50
pg_stat file 27
read consistency
pg_stst_tmp file 27
about 33
pg_subtrans directory 27
components 33
pg_tblspc directory 28
mixed reads and writes 35, 36
pg_twophase directory 28
shared buffer, need for 34
PG_VERSION file 23
recovery.conf
pg_xlog log 28, 29
defining 97
postgresql.conf file 29
recovery_end_command parameter
shared memory 26
tweaking 98
PostgreSQL eXtensible Cluster. See
redundancy 115
Postgres-XC
RELOAD command 167
PostgreSQL, REPLICA IDENTITY
replication
DEFAULT 60
about 1, 70
FULL 61
and XLOG 38
NOTHING 61
checking 179, 180
USING INDEX 60
controlling 269
PostgreSQL resource agent
defining 7
(PG RA) 133, 134
logical, versus physical replication 9, 10
PostgreSQL streaming
physical limitations 2
using 182
setting up 173-175
[ 290 ]
single-master, versus multimaster ignore_leading_white_space 176
replication 8 insert_lock 175
stopping 115 listen_addresses 174
synchronous, versus asynchronous load_balance_mode 176
replication 6 max_pool 175
types 5 num_init_children 175
replication identities pcp_port 174
configuring 59, 60 pcp_socket_dir 174
replication slots pid_file_name 174
checking for 122, 123 pool_passwd 175
logical replication slots 56-59 port 174
physical replication slots 54, 55 replicate_select 175
using 53 replication_mode 175
resource agents 132, 133 reset_query_list 175
restart points socket_dir 174
control, gaining over 97 ssl 175
round trip 5 white_function_list 176
sharding
S and redundancy, selecting between 14, 15
and replication, combining 17, 18
Segmented Least Recently Used (SLRU) 27 cluster size, decreasing 15-17
sets cluster size, increasing 15-17
data tables, handling 269 fields, querying 12, 13
defining 268 implementing 19
unidirectional replication 268, 269 need for 11
settings, base backup pros 13
primary_conninfo 86 sharded system, designing 11, 12
standby_mode 85 using 11
settings, pgpool sharding solutions
--enable-sequence-lock 170 defining 18
--enable-table-lock 170 PostgreSQL-based sharding 19
--with-memached 170 shoot the other node in the head
--with-openssl 169 (STONITH) 131
authentication_timeout 175 single-master replication
backend_data_directory0 174 versus multimaster replication 8
backend_flag 175 SkyTools
backend_hostname0 174 about 205
backend_port0 174 dissecting 206
backend_weight0 174 installing 205, 206
black_function_list 176 URL 205
child_life_time 175 slave
child_max_connections 175 firing up 93
client_idle_limit 175 turning, to masters 90, 91
connection_cache 175 slave configuration 92, 93
connection_life_time 175 slon daemon 188, 189
enable_pool_hba 175 slonik interface 190
[ 291 ]
Slony working 107-109
about 185 synchronous, versus asynchronous
installing 185, 186 replication
logical replication, dealing with 187, 188 performance issues, considering 7
slon daemon 188, 189
URL 185 T
working 186
split-brain syndrome tables
defining 131 adding, to replication 196-200
start operation 142 creating 234, 235
stop operation 142 replicating 217
streaming replication t_person 232
and file-based recovery, mixing 91 t_person_payment 232
config files, tweaking on master 84, 85 t_postal_code 232
making, more robust 95 timelines
pg_basebackup, handling 85, 86 about 101
pgpool, running with 180, 181 dealing with 101
physical replication slots 95 timeout parameter 178
recovery.conf, handling 85, 86 traditional methods
replication slots, utilizing 96 used, for creating base backups 73, 74
setting up 83, 84 transaction log
slave, making readable 86, 87 about 28
wal_keep_segments 95 archiving 66-68
wal_keep_segments, using 95 basic recovery, performing 75-78
wal_receiver 87 positioning, in XLOG 78, 79
wal_sender 87 replaying 75
suggestions, synchronous replication XLOG, cleaning up 80
longer transactions, using 114 XLOG files, switching 81
run stuff concurrently 115 transaction mode
synchronous_commit advantages 159
setting, to local 112
setting, to off 111 U
setting, to on 111
unplanned failovers 202
setting, to remote_write 111
uptime 128
synchronous replication
use cases, for BDR
application_name parameter,
about 258
defining 106, 107
bad use cases 258
checking 109, 110
good use cases 258
considerations 106
defining 106
durability settings, changing on fly 112, 113
W
performance issues, defining 110 WAL 31
setting up 105, 106 walbouncer
synchronous_commit, setting to on 111 about 271, 272
versus asynchronous replication 6 additional configuration options 280, 281
[ 292 ]
base backup, creating 275, 276 defining 49
configuring 274 filtering 273
filtering rules, adjusting 281 format 36, 37
firing up 277-280 handling 51
installing 273, 274 headers 37
reference 273 Logical Sequence Number (LSN) 51
working 272 shared buffer interaction 51
XLOG, filtering 273 tuning 44
WAL buffers XLOG records
tweaking 49 defining 50
wal_level parameter 84 XLOG, making deterministic 50
wal_level settings XLOG, making reliable 50, 51
defining 67
walmgr tool 221
WAL timeline ID
benefits 134
X
XLOG
about 21, 31, 37
addresses 37
and replication 38
consequences 273
debugging 52, 53
[ 293 ]
Thank you for buying
PostgreSQL Replication
Second Edition
PostgreSQL 9 High
Availability Cookbook
ISBN: 978-1-84951-696-9 Paperback: 398 pages