|
| 1 | +Memory Barriers |
| 2 | +=============== |
| 3 | + |
| 4 | +Modern CPUs make extensive use of pipe-lining and out-of-order execution, |
| 5 | +meaning that the CPU is often executing more than one instruction at a |
| 6 | +time, and not necessarily in the order that the source code would suggest. |
| 7 | +Furthermore, even before the CPU gets a chance to reorder operations, the |
| 8 | +compiler may (and often does) reorganize the code for greater efficiency, |
| 9 | +particularly at higher optimization levels. Optimizing compilers and |
| 10 | +out-of-order execution are both critical for good performance, but they |
| 11 | +can lead to surprising results when multiple processes access the same |
| 12 | +memory space. |
| 13 | + |
| 14 | +Example |
| 15 | +======= |
| 16 | + |
| 17 | +Suppose x is a pointer to a structure stored in shared memory, and that the |
| 18 | +entire structure has been initialized to zero bytes. One backend executes |
| 19 | +the following code fragment: |
| 20 | + |
| 21 | + x->foo = 1; |
| 22 | + x->bar = 1; |
| 23 | + |
| 24 | +Meanwhile, at approximately the same time, another backend executes this |
| 25 | +code fragment: |
| 26 | + |
| 27 | + bar = x->bar; |
| 28 | + foo = x->foo; |
| 29 | + |
| 30 | +The second backend might end up with foo = 1 and bar = 1 (if it executes |
| 31 | +both statements after the first backend), or with foo = 0 and bar = 0 (if |
| 32 | +it executes both statements before the first backend), or with foo = 1 and |
| 33 | +bar = 0 (if the first backend executes the first statement, the second |
| 34 | +backend executes both statements, and then the first backend executes the |
| 35 | +second statement). |
| 36 | + |
| 37 | +Surprisingly, however, the second backend could also end up with foo = 0 |
| 38 | +and bar = 1. The compiler might swap the order of the two stores performed |
| 39 | +by the first backend, or the two loads performed by the second backend. |
| 40 | +Even if it doesn't, on a machine with weak memory ordering (such as PowerPC |
| 41 | +or Itanium) the CPU might choose to execute either the loads or the stores |
| 42 | +out of order. This surprising result can lead to bugs. |
| 43 | + |
| 44 | +A common pattern where this actually does result in a bug is when adding items |
| 45 | +onto a queue. The writer does this: |
| 46 | + |
| 47 | + q->items[q->num_items] = new_item; |
| 48 | + ++q->num_items; |
| 49 | + |
| 50 | +The reader does this: |
| 51 | + |
| 52 | + num_items = q->num_items; |
| 53 | + for (i = 0; i < num_items; ++i) |
| 54 | + /* do something with q->items[i] */ |
| 55 | + |
| 56 | +This code turns out to be unsafe, because the writer might increment |
| 57 | +q->num_items before it finishes storing the new item into the appropriate slot. |
| 58 | +More subtly, the reader might prefetch the contents of the q->items array |
| 59 | +before reading q->num_items. Thus, there's still a bug here *even if the |
| 60 | +writer does everything in the order we expect*. We need the writer to update |
| 61 | +the array before bumping the item counter, and the reader to examine the item |
| 62 | +counter before examining the array. |
| 63 | + |
| 64 | +Note that these types of highly counterintuitive bugs can *only* occur when |
| 65 | +multiple processes are interacting with the same memory segment. A given |
| 66 | +process always perceives its *own* writes to memory in program order. |
| 67 | + |
| 68 | +Avoiding Memory Ordering Bugs |
| 69 | +============================= |
| 70 | + |
| 71 | +The simplest (and often best) way to avoid memory ordering bugs is to |
| 72 | +protect the data structures involved with an lwlock. For more details, see |
| 73 | +src/backend/storage/lmgr/README. For instance, in the above example, the |
| 74 | +writer could acquire an lwlock in exclusive mode before appending to the |
| 75 | +queue, and each reader could acquire the same lock in shared mode before |
| 76 | +reading it. If the data structure is not heavily trafficked, this solution is |
| 77 | +generally entirely adequate. |
| 78 | + |
| 79 | +However, in some cases, it is desirable to avoid the overhead of acquiring |
| 80 | +and releasing locks. In this case, memory barriers may be used to ensure |
| 81 | +that the apparent order of execution is as the programmer desires. In |
| 82 | +PostgreSQL backend code, the pg_memory_barrier() macro may be used to achieve |
| 83 | +this result. In the example above, we can prevent the reader from seeing a |
| 84 | +garbage value by having the writer do this: |
| 85 | + |
| 86 | + q->items[q->num_items] = new_item; |
| 87 | + pg_memory_barrier(); |
| 88 | + ++q->num_items; |
| 89 | + |
| 90 | +And by having the reader do this: |
| 91 | + |
| 92 | + num_items = q->num_items; |
| 93 | + pg_memory_barrier(); |
| 94 | + for (i = 0; i < num_items; ++i) |
| 95 | + /* do something with q->items[i] */ |
| 96 | + |
| 97 | +The pg_memory_barrier() macro will (1) prevent the compiler from rearranging |
| 98 | +the code in such a way as to allow the memory accesses to occur out of order |
| 99 | +and (2) generate any code (often, inline assembly) that is needed to prevent |
| 100 | +the CPU from executing the memory accesses out of order. Specifically, the |
| 101 | +barrier prevents loads and stores written after the barrier from being |
| 102 | +performed before the barrier, and vice-versa. |
| 103 | + |
| 104 | +Although this code will work, it is needlessly inefficient. On systems with |
| 105 | +strong memory ordering (such as x86), the CPU never reorders loads with other |
| 106 | +loads, nor stores with other stores. It can, however, allow a load to |
| 107 | +performed before a subsequent store. To avoid emitting unnecessary memory |
| 108 | +instructions, we provide two additional primitives: pg_read_barrier(), and |
| 109 | +pg_write_barrier(). When a memory barrier is being used to separate two |
| 110 | +loads, use pg_read_barrier(); when it is separating two stores, use |
| 111 | +pg_write_barrier(); when it is a separating a load and a store (in either |
| 112 | +order), use pg_memory_barrier(). pg_memory_barrier() can always substitute |
| 113 | +for either a read or a write barrier, but is typically more expensive, and |
| 114 | +therefore should be used only when needed. |
| 115 | + |
| 116 | +With these guidelines in mind, the writer can do this: |
| 117 | + |
| 118 | + q->items[q->num_items] = new_item; |
| 119 | + pg_write_barrier(); |
| 120 | + ++q->num_items; |
| 121 | + |
| 122 | +And the reader can do this: |
| 123 | + |
| 124 | + num_items = q->num_items; |
| 125 | + pg_read_barrier(); |
| 126 | + for (i = 0; i < num_items; ++i) |
| 127 | + /* do something with q->items[i] */ |
| 128 | + |
| 129 | +On machines with strong memory ordering, these weaker barriers will simply |
| 130 | +prevent compiler rearrangement, without emitting any actual machine code. |
| 131 | +On machines with weak memory ordering, they will will prevent compiler |
| 132 | +reordering and also emit whatever hardware barrier may be required. Even |
| 133 | +on machines with weak memory ordering, a read or write barrier may be able |
| 134 | +to use a less expensive instruction than a full barrier. |
| 135 | + |
| 136 | +Weaknesses of Memory Barriers |
| 137 | +============================= |
| 138 | + |
| 139 | +While memory barriers are a powerful tool, and much cheaper than locks, they |
| 140 | +are also much less capable than locks. Here are some of the problems. |
| 141 | + |
| 142 | +1. Concurrent writers are unsafe. In the above example of a queue, using |
| 143 | +memory barriers doesn't make it safe for two processes to add items to the |
| 144 | +same queue at the same time. If more than one process can write to the queue, |
| 145 | +a spinlock or lwlock must be used to synchronize access. The readers can |
| 146 | +perhaps proceed without any lock, but the writers may not. |
| 147 | + |
| 148 | +Even very simple write operations often require additional synchronization. |
| 149 | +For example, it's not safe for multiple writers to simultaneously execute |
| 150 | +this code (supposing x is a pointer into shared memory): |
| 151 | + |
| 152 | + x->foo++; |
| 153 | + |
| 154 | +Although this may compile down to a single machine-language instruction, |
| 155 | +the CPU will execute that instruction by reading the current value of foo, |
| 156 | +adding one to it, and then storing the result back to the original address. |
| 157 | +If two CPUs try to do this simultaneously, both may do their reads before |
| 158 | +either one does their writes. Eventually we might be able to use an atomic |
| 159 | +fetch-and-add instruction for this specific case on architectures that support |
| 160 | +it, but we can't rely on that being available everywhere, and we currently |
| 161 | +have no support for it at all. Use a lock. |
| 162 | + |
| 163 | +2. Eight-byte loads and stores aren't necessarily atomic. We assume in |
| 164 | +various places in the source code that an aligned four-byte load or store is |
| 165 | +atomic, and that other processes therefore won't see a half-set value. |
| 166 | +Sadly, the same can't be said for eight-byte value: on some platforms, an |
| 167 | +aligned eight-byte load or store will generate two four-byte operations. If |
| 168 | +you need an atomic eight-byte read or write, you must make it atomic with a |
| 169 | +lock. |
| 170 | + |
| 171 | +3. No ordering guarantees. While memory barriers ensure that any given |
| 172 | +process performs loads and stores to shared memory in order, they don't |
| 173 | +guarantee synchronization. In the queue example above, we can use memory |
| 174 | +barriers to be sure that readers won't see garbage, but there's nothing to |
| 175 | +say whether a given reader will run before or after a given writer. If this |
| 176 | +matters in a given situation, some other mechanism must be used instead of |
| 177 | +or in addition to memory barriers. |
| 178 | + |
| 179 | +4. Barrier proliferation. Many algorithms that at first seem appealing |
| 180 | +require multiple barriers. If the number of barriers required is more than |
| 181 | +one or two, you may be better off just using a lock. Keep in mind that, on |
| 182 | +some platforms, a barrier may be implemented by acquiring and releasing a |
| 183 | +backend-private spinlock. This may be better than a centralized lock under |
| 184 | +contention, but it may also be slower in the uncontended case. |
| 185 | + |
| 186 | +Further Reading |
| 187 | +=============== |
| 188 | + |
| 189 | +Much of the documentation about memory barriers appears to be quite |
| 190 | +Linux-specific. The following papers may be helpful: |
| 191 | + |
| 192 | +Memory Ordering in Modern Microprocessors, by Paul E. McKenney |
| 193 | +* http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf |
| 194 | + |
| 195 | +Memory Barriers: a Hardware View for Software Hackers, by Paul E. McKenney |
| 196 | +* http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf |
| 197 | + |
| 198 | +The Linux kernel also has some useful documentation on this topic. Start |
| 199 | +with Documentation/memory-barriers.txt |
0 commit comments