Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 0c8eda6

Browse files
committed
Memory barrier support for PostgreSQL.
This is not actually used anywhere yet, but it gets the basic infrastructure in place. It is fairly likely that there are bugs, and support for some important platforms may be missing, so we'll need to refine this as we go along.
1 parent 291873c commit 0c8eda6

File tree

3 files changed

+371
-0
lines changed

3 files changed

+371
-0
lines changed
+199
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
Memory Barriers
2+
===============
3+
4+
Modern CPUs make extensive use of pipe-lining and out-of-order execution,
5+
meaning that the CPU is often executing more than one instruction at a
6+
time, and not necessarily in the order that the source code would suggest.
7+
Furthermore, even before the CPU gets a chance to reorder operations, the
8+
compiler may (and often does) reorganize the code for greater efficiency,
9+
particularly at higher optimization levels. Optimizing compilers and
10+
out-of-order execution are both critical for good performance, but they
11+
can lead to surprising results when multiple processes access the same
12+
memory space.
13+
14+
Example
15+
=======
16+
17+
Suppose x is a pointer to a structure stored in shared memory, and that the
18+
entire structure has been initialized to zero bytes. One backend executes
19+
the following code fragment:
20+
21+
x->foo = 1;
22+
x->bar = 1;
23+
24+
Meanwhile, at approximately the same time, another backend executes this
25+
code fragment:
26+
27+
bar = x->bar;
28+
foo = x->foo;
29+
30+
The second backend might end up with foo = 1 and bar = 1 (if it executes
31+
both statements after the first backend), or with foo = 0 and bar = 0 (if
32+
it executes both statements before the first backend), or with foo = 1 and
33+
bar = 0 (if the first backend executes the first statement, the second
34+
backend executes both statements, and then the first backend executes the
35+
second statement).
36+
37+
Surprisingly, however, the second backend could also end up with foo = 0
38+
and bar = 1. The compiler might swap the order of the two stores performed
39+
by the first backend, or the two loads performed by the second backend.
40+
Even if it doesn't, on a machine with weak memory ordering (such as PowerPC
41+
or Itanium) the CPU might choose to execute either the loads or the stores
42+
out of order. This surprising result can lead to bugs.
43+
44+
A common pattern where this actually does result in a bug is when adding items
45+
onto a queue. The writer does this:
46+
47+
q->items[q->num_items] = new_item;
48+
++q->num_items;
49+
50+
The reader does this:
51+
52+
num_items = q->num_items;
53+
for (i = 0; i < num_items; ++i)
54+
/* do something with q->items[i] */
55+
56+
This code turns out to be unsafe, because the writer might increment
57+
q->num_items before it finishes storing the new item into the appropriate slot.
58+
More subtly, the reader might prefetch the contents of the q->items array
59+
before reading q->num_items. Thus, there's still a bug here *even if the
60+
writer does everything in the order we expect*. We need the writer to update
61+
the array before bumping the item counter, and the reader to examine the item
62+
counter before examining the array.
63+
64+
Note that these types of highly counterintuitive bugs can *only* occur when
65+
multiple processes are interacting with the same memory segment. A given
66+
process always perceives its *own* writes to memory in program order.
67+
68+
Avoiding Memory Ordering Bugs
69+
=============================
70+
71+
The simplest (and often best) way to avoid memory ordering bugs is to
72+
protect the data structures involved with an lwlock. For more details, see
73+
src/backend/storage/lmgr/README. For instance, in the above example, the
74+
writer could acquire an lwlock in exclusive mode before appending to the
75+
queue, and each reader could acquire the same lock in shared mode before
76+
reading it. If the data structure is not heavily trafficked, this solution is
77+
generally entirely adequate.
78+
79+
However, in some cases, it is desirable to avoid the overhead of acquiring
80+
and releasing locks. In this case, memory barriers may be used to ensure
81+
that the apparent order of execution is as the programmer desires. In
82+
PostgreSQL backend code, the pg_memory_barrier() macro may be used to achieve
83+
this result. In the example above, we can prevent the reader from seeing a
84+
garbage value by having the writer do this:
85+
86+
q->items[q->num_items] = new_item;
87+
pg_memory_barrier();
88+
++q->num_items;
89+
90+
And by having the reader do this:
91+
92+
num_items = q->num_items;
93+
pg_memory_barrier();
94+
for (i = 0; i < num_items; ++i)
95+
/* do something with q->items[i] */
96+
97+
The pg_memory_barrier() macro will (1) prevent the compiler from rearranging
98+
the code in such a way as to allow the memory accesses to occur out of order
99+
and (2) generate any code (often, inline assembly) that is needed to prevent
100+
the CPU from executing the memory accesses out of order. Specifically, the
101+
barrier prevents loads and stores written after the barrier from being
102+
performed before the barrier, and vice-versa.
103+
104+
Although this code will work, it is needlessly inefficient. On systems with
105+
strong memory ordering (such as x86), the CPU never reorders loads with other
106+
loads, nor stores with other stores. It can, however, allow a load to
107+
performed before a subsequent store. To avoid emitting unnecessary memory
108+
instructions, we provide two additional primitives: pg_read_barrier(), and
109+
pg_write_barrier(). When a memory barrier is being used to separate two
110+
loads, use pg_read_barrier(); when it is separating two stores, use
111+
pg_write_barrier(); when it is a separating a load and a store (in either
112+
order), use pg_memory_barrier(). pg_memory_barrier() can always substitute
113+
for either a read or a write barrier, but is typically more expensive, and
114+
therefore should be used only when needed.
115+
116+
With these guidelines in mind, the writer can do this:
117+
118+
q->items[q->num_items] = new_item;
119+
pg_write_barrier();
120+
++q->num_items;
121+
122+
And the reader can do this:
123+
124+
num_items = q->num_items;
125+
pg_read_barrier();
126+
for (i = 0; i < num_items; ++i)
127+
/* do something with q->items[i] */
128+
129+
On machines with strong memory ordering, these weaker barriers will simply
130+
prevent compiler rearrangement, without emitting any actual machine code.
131+
On machines with weak memory ordering, they will will prevent compiler
132+
reordering and also emit whatever hardware barrier may be required. Even
133+
on machines with weak memory ordering, a read or write barrier may be able
134+
to use a less expensive instruction than a full barrier.
135+
136+
Weaknesses of Memory Barriers
137+
=============================
138+
139+
While memory barriers are a powerful tool, and much cheaper than locks, they
140+
are also much less capable than locks. Here are some of the problems.
141+
142+
1. Concurrent writers are unsafe. In the above example of a queue, using
143+
memory barriers doesn't make it safe for two processes to add items to the
144+
same queue at the same time. If more than one process can write to the queue,
145+
a spinlock or lwlock must be used to synchronize access. The readers can
146+
perhaps proceed without any lock, but the writers may not.
147+
148+
Even very simple write operations often require additional synchronization.
149+
For example, it's not safe for multiple writers to simultaneously execute
150+
this code (supposing x is a pointer into shared memory):
151+
152+
x->foo++;
153+
154+
Although this may compile down to a single machine-language instruction,
155+
the CPU will execute that instruction by reading the current value of foo,
156+
adding one to it, and then storing the result back to the original address.
157+
If two CPUs try to do this simultaneously, both may do their reads before
158+
either one does their writes. Eventually we might be able to use an atomic
159+
fetch-and-add instruction for this specific case on architectures that support
160+
it, but we can't rely on that being available everywhere, and we currently
161+
have no support for it at all. Use a lock.
162+
163+
2. Eight-byte loads and stores aren't necessarily atomic. We assume in
164+
various places in the source code that an aligned four-byte load or store is
165+
atomic, and that other processes therefore won't see a half-set value.
166+
Sadly, the same can't be said for eight-byte value: on some platforms, an
167+
aligned eight-byte load or store will generate two four-byte operations. If
168+
you need an atomic eight-byte read or write, you must make it atomic with a
169+
lock.
170+
171+
3. No ordering guarantees. While memory barriers ensure that any given
172+
process performs loads and stores to shared memory in order, they don't
173+
guarantee synchronization. In the queue example above, we can use memory
174+
barriers to be sure that readers won't see garbage, but there's nothing to
175+
say whether a given reader will run before or after a given writer. If this
176+
matters in a given situation, some other mechanism must be used instead of
177+
or in addition to memory barriers.
178+
179+
4. Barrier proliferation. Many algorithms that at first seem appealing
180+
require multiple barriers. If the number of barriers required is more than
181+
one or two, you may be better off just using a lock. Keep in mind that, on
182+
some platforms, a barrier may be implemented by acquiring and releasing a
183+
backend-private spinlock. This may be better than a centralized lock under
184+
contention, but it may also be slower in the uncontended case.
185+
186+
Further Reading
187+
===============
188+
189+
Much of the documentation about memory barriers appears to be quite
190+
Linux-specific. The following papers may be helpful:
191+
192+
Memory Ordering in Modern Microprocessors, by Paul E. McKenney
193+
* http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
194+
195+
Memory Barriers: a Hardware View for Software Hackers, by Paul E. McKenney
196+
* http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf
197+
198+
The Linux kernel also has some useful documentation on this topic. Start
199+
with Documentation/memory-barriers.txt

src/backend/storage/lmgr/s_lock.c

+1
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020

2121
#include "storage/s_lock.h"
2222

23+
slock_t dummy_spinlock;
2324

2425
static int spins_per_delay = DEFAULT_SPINS_PER_DELAY;
2526

src/include/storage/barrier.h

+171
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
/*-------------------------------------------------------------------------
2+
*
3+
* barrier.h
4+
* Memory barrier operations.
5+
*
6+
* Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
7+
* Portions Copyright (c) 1994, Regents of the University of California
8+
*
9+
* src/include/storage/barrier.h
10+
*
11+
*-------------------------------------------------------------------------
12+
*/
13+
#ifndef BARRIER_H
14+
#define BARRIER_H
15+
16+
#include "storage/s_lock.h"
17+
18+
extern slock_t dummy_spinlock;
19+
20+
/*
21+
* A compiler barrier need not (and preferably should not) emit any actual
22+
* machine code, but must act as an optimization fence: the compiler must not
23+
* reorder loads or stores to main memory around the barrier. However, the
24+
* CPU may still reorder loads or stores at runtime, if the architecture's
25+
* memory model permits this.
26+
*
27+
* A memory barrier must act as a compiler barrier, and in addition must
28+
* guarantee that all loads and stores issued prior to the barrier are
29+
* completed before any loads or stores issued after the barrier. Unless
30+
* loads and stores are totally ordered (which is not the case on most
31+
* architectures) this requires issuing some sort of memory fencing
32+
* instruction.
33+
*
34+
* A read barrier must act as a compiler barrier, and in addition must
35+
* guarantee that any loads issued prior to the barrier are completed before
36+
* any loads issued after the barrier. Similarly, a write barrier acts
37+
* as a compiler barrier, and also orders stores. Read and write barriers
38+
* are thus weaker than a full memory barrier, but stronger than a compiler
39+
* barrier. In practice, on machines with strong memory ordering, read and
40+
* write barriers may require nothing more than a compiler barrier.
41+
*
42+
* For an introduction to using memory barriers within the PostgreSQL backend,
43+
* see src/backend/storage/lmgr/README.barrier
44+
*/
45+
46+
#if defined(DISABLE_BARRIERS)
47+
48+
/*
49+
* Fall through to the spinlock-based implementation.
50+
*/
51+
52+
#elif defined(__INTEL_COMPILER)
53+
54+
/*
55+
* icc defines __GNUC__, but doesn't support gcc's inline asm syntax
56+
*/
57+
#define pg_memory_barrier() _mm_mfence()
58+
#define pg_compiler_barrier() __memory_barrier()
59+
60+
#elif defined(__GNUC__)
61+
62+
/* This works on any architecture, since it's only talking to GCC itself. */
63+
#define pg_compiler_barrier() __asm__ __volatile__("" : : : "memory")
64+
65+
#if defined(__i386__) || defined(__x86_64__) /* 32 or 64 bit x86 */
66+
67+
/*
68+
* x86 and x86_64 do not allow loads to be reorded with other loads, or
69+
* stores to be reordered with other stores, but a load can be performed
70+
* before a subsequent store.
71+
*
72+
* "lock; addl" has worked for longer than "mfence".
73+
*
74+
* Technically, some x86-ish chips support uncached memory access and/or
75+
* special instructions that are weakly ordered. In those cases we'd need
76+
* the read and write barriers to be lfence and sfence. But since we don't
77+
* do those things, a compiler barrier should be enough.
78+
*/
79+
#define pg_memory_barrier() \
80+
__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
81+
#define pg_read_barrier() pg_compiler_barrier()
82+
#define pg_write_barrier() pg_compiler_barrier()
83+
84+
#elif defined(__ia64__) || defined(__ia64)
85+
86+
/*
87+
* Itanium is weakly ordered, so read and write barriers require a full
88+
* fence.
89+
*/
90+
#define pg_memory_barrier() __asm__ __volatile__ ("mf" : : : "memory")
91+
92+
#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
93+
94+
/*
95+
* lwsync orders loads with respect to each other, and similarly with stores.
96+
* But a load can be performed before a subsequent store, so sync must be used
97+
* for a full memory barrier.
98+
*/
99+
#define pg_memory_barrier() __asm__ __volatile__ ("sync" : : : "memory")
100+
#define pg_read_barrier() __asm__ __volatile__ ("lwsync" : : : "memory")
101+
#define pg_write_barrier() __asm__ __volatile__ ("lwsync" : : : "memory")
102+
103+
#elif defined(__alpha) || defined(__alpha__) /* Alpha */
104+
105+
/*
106+
* Unlike all other known architectures, Alpha allows dependent reads to be
107+
* reordered, but we don't currently find it necessary to provide a conditional
108+
* read barrier to cover that case. We might need to add that later.
109+
*/
110+
#define pg_memory_barrier() __asm__ __volatile__ ("mb" : : : "memory")
111+
#define pg_read_barrier() __asm__ __volatile__ ("rmb" : : : "memory")
112+
#define pg_write_barrier() __asm__ __volatile__ ("wmb" : : : "memory")
113+
114+
#elif __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)
115+
116+
/*
117+
* If we're on GCC 4.1.0 or higher, we should be able to get a memory
118+
* barrier out of this compiler built-in. But we prefer to rely on our
119+
* own definitions where possible, and use this only as a fallback.
120+
*/
121+
#define pg_memory_barrier() __sync_synchronize()
122+
123+
#endif
124+
125+
#elif defined(__ia64__) || defined(__ia64)
126+
127+
#define pg_compiler_barrier() _Asm_sched_fence()
128+
#define pg_memory_barrier() _Asm_mf()
129+
130+
#elif defined(WIN32_ONLY_COMPILER)
131+
132+
/* Should work on both MSVC and Borland. */
133+
#include <intrin.h>
134+
#pragma intrinsic(_ReadWriteBarrier)
135+
#define pg_compiler_barrier() _ReadWriteBarrier()
136+
#define pg_memory_barrier() MemoryBarrier()
137+
138+
#endif
139+
140+
/*
141+
* If we have no memory barrier implementation for this architecture, we
142+
* fall back to acquiring and releasing a spinlock. This might, in turn,
143+
* fall back to the semaphore-based spinlock implementation, which will be
144+
* amazingly slow.
145+
*
146+
* It's not self-evident that every possible legal implementation of a
147+
* spinlock acquire-and-release would be equivalent to a full memory barrier.
148+
* For example, I'm not sure that Itanium's acq and rel add up to a full
149+
* fence. But all of our actual implementations seem OK in this regard.
150+
*/
151+
#if !defined(pg_memory_barrier)
152+
#define pg_memory_barrier(x) \
153+
do { S_LOCK(&dummy_spinlock); S_UNLOCK(&dummy_spinlock); } while (0)
154+
#endif
155+
156+
/*
157+
* If read or write barriers are undefined, we upgrade them to full memory
158+
* barriers.
159+
*
160+
* If a compiler barrier is unavailable, you probably don't want a full
161+
* memory barrier instead, so if you have a use case for a compiler barrier,
162+
* you'd better use #ifdef.
163+
*/
164+
#if !defined(pg_read_barrier)
165+
#define pg_read_barrier() pg_memory_barrier()
166+
#endif
167+
#if !defined(pg_write_barrier)
168+
#define pg_write_barrier() pg_memory_barrier()
169+
#endif
170+
171+
#endif /* BARRIER_H */

0 commit comments

Comments
 (0)