Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Synchronization Primitives

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Synchronization Primitives

Several synchronization primitives have been introduced to aid in


multithreading the kernel. These primitives are implemented by atomic
operations and use appropriate memory barriers so that users of these
primitives do not have to worry about doing it themselves. The primitives are
very similar to those used in other operating systems including mutexes,
condition variables, shared/exclusive locks, and semaphores.

Mutexes
The mutex primitive provides mutual exclusion for one or more data objects.
Two versions of the mutex primitive are provided: spin mutexes and sleep
mutexes.

Spin mutexes are a simple spin lock. If the lock is held by another thread when
a thread tries to acquire it, the second thread will spin waiting for the lock to
be released. Due to this spinning nature, a context switch cannot be performed
while holding a spin mutex to avoid deadlocking in the case of a thread owning
a spin lock not being executed on a CPU and all other CPUs spinning on that
lock. An exception to this is the scheduler lock, which must be held during a
context switch. As a special case, the ownership of the scheduler lock is passed
from the thread being switched out to the thread being switched in to satisfy
this requirement while still protecting the scheduler data structures. Since the
bottom half code that schedules threaded interrupts and runs non-threaded
interrupt handlers also uses spin mutexes, spin mutexes must disable
interrupts while they are held to prevent bottom half code from deadlocking
against the top half code it is interrupting on the current CPU. Disabling
interrupts while holding a spin lock has the unfortunate side effect of
increasing interrupt latency.

To work around this, a second mutex primitive is provided that performs a


context switch when a thread blocks on a mutex. This second type of mutex is
dubbed a sleep mutex. Since a thread that contests on a sleep mutex blocks
instead of spinning, it is not susceptible to the first type of deadlock with spin
locks. Sleep mutexes cannot be used in bottom half code, so they do not need
to disable interrupts while they are held to avoid the second type of deadlock
with spin locks.
As with Solaris, when a thread blocks on a sleep mutex, it propagates its
priority to the lock owner. Therefore, if a thread blocks on a sleep mutex and
its priority is higher than the thread that currently owns the sleep mutex, the
current owner will inherit the priority of the first thread. If the owner of the
sleep mutex is blocked on another mutex, then the entire chain of threads will
be traversed bumping the priority of any threads if needed until a runnable
thread is found. This is to deal with the problem of priority inversion where a
lower priority thread blocks a higher priority thread. By bumping the priority of
the lower priority thread until it releases the lock the higher priority thread is
blocked on, the kernel guarantees that the higher priority thread will get to run
as soon as its priority allows.

These two types of mutexes are similar to the Solaris spin and adaptive
mutexes. One difference from the Solaris API is that acquiring and releasing a
spin mutex uses different functions than acquiring and releasing a sleep mutex.
A difference with the Solaris implementation is that sleep mutexes are not
adaptive. Details of the Solaris mutex API and implementation can be found in
section 3.5 of [Mauro01].

Condition Variables
Condition variables provide a logical abstraction for blocking a thread while
waiting for a condition. Condition variables do not contain the actual condition
to test, instead, one locks the appropriate mutex, tests the condition, and then
blocks on the condition variable if the condition is not true. To prevent lost
wakeups, the mutex is passed in as an interlock when waiting on a condition.

FreeBSD's condition variables use an API quite similar to those provided in


Solaris. The only differences being the lack of a cv_wait_sig_swap and the
addition of cv_init and cv_destroy constructors and destructors. The
implementation also differs from Solaris in that the sleep queue is embedded
in the condition variable itself instead of coming from the hashed pool of sleep
queue's used by sleep andwakeup.

Shared/Exclusive Locks
Shared/Exclusive locks, also known as sx locks, provide simple reader/writer
locks. As the name suggests, multiple threads may hold a shared lock
simultaneously, but only one thread may hold an exclusive lock. Also, if one
thread holds an exclusive lock, no threads may hold a shared lock.
FreeBSD's sx locks have some limitations not present in other reader/writer
lock implementations. First, a thread may not recursively acquire an exclusive
lock. Secondly, sx locks do not implement any sort of priority propagation.
Finally, although upgrades and downgrades of locks are implemented, they
may not block. Instead, if an upgrade cannot succeed, it returns failure, and
the programmer is required to explicitly drop its shared lock and acquire an
exclusive lock. This design was intentional to prevent programmers from
making false assumptions about a blocking upgrade function. Specifically, a
blocking upgrade must potentially release its shared lock. Also, another thread
may obtain an exclusive lock before a thread trying to perform an upgrade. For
example, if two threads are performing an upgrade on a lock at the same time.

Semaphores

Semaphore is used as a synchronization tool. A semaphore S is an integer


variable that, apart from initialization, is accessed only through two standard
atomic operations: wait and signal. P(for Wait) and V(for Signal).

The classical definition of Wait in pseudocode is

Wait(S) {

While(S<=0)

;//no-op

S--;

For Signal

Signal(S){

S++;

} When one process modifies the semaphore value, no other process can
simultaneously modify that same semaphore value.
Atomic Primitives
Atomic primitives are arguably the most important tool in programming that
requires coordination between multiple threads and/or processors/cores.
There are four basic types of atomic primitives: swap, fetch and phi, compare
and swap, and load linked/store conditional. Usually these operations are
performed on values the same size as or smaller than a word (the size of the
processor's registers), though sometimes consecutive (single address)
multiword versions are provided.

The most primitive (pun not intended) is the swap operation. This operation
exchanges a value in memory with a value in a register, atomically setting the
value in memory and returning the original value in memory. This is not
actually very useful, in terms of multiprogramming, with only a single practical
use: the construction of test and set (TAS; or test and test and set - TATAS)
spinlocks. In these spinlocks, each thread attempting to enter the spinlock
spins, exchanging 1 into the spinlock value in each iteration. If 0 is returned,
the spinlock was free, and the thread now owns the spinlock; if 1 is returned,
the spinlock was already owned by another thread, and the thread attempting
to enter the spinlock must keep spinning. Ultimately, the owning thread leaves
the spinlock by setting the spinlock value to 0.

C/C++ - Code

1. ATOMIC word swap(volatile word &value, word newValue)


2. {
3. word oldValue = value;
4. value = newValue;
5.
6. return oldValue;
7. }
8.
9. void EnterSpinlock(volatile word &value)
10.{
11.// Test and test and set spinlock
12.while (value == 1 || swap(value, 1) == 1) {}
13.}
14.
15.void LeaveSpinlock(volatile word &value)
16.{
17.value = 0;
18.}
This makes for an extremely crude spinlock. The fact that there is only a single
value all threads share means that spinning can create a lot of cache coherency
traffic, as all spinning processors will be writing to the same address,
continually invalidating each other's caches. Furthermore, this mechanism
precludes any kind of order preservation, as it's impossible to distinguish when
a given thread began waiting; threads may enter the spinlock in any order,
regardless of how long any thread has been waiting to enter the spinlock.

Next up the power scale is the fetch and phi family. All members of this family
follow the same basic process: a value is atomically read from memory,
modified by the processor, and written back, with the original value returned
to the program. The modification performed can be almost anything; one of
the most useful modifications is the add operation (in this case it's called fetch
and add). The fetch and add operation is notably more useful than the swap
operation, but is still less than ideal; in addition to test and set spinlocks, fetch
and add can be used to create thread-safe counters, and spinlocks that both
preserve order and (potentially) greatly reduce cache coherency traffic.

1. ATOMIC word fetch_and_add(volatile word &value, word addend)


2. {
3. word oldValue = value;
4. value += addend;
5.
6. return oldValue;
7. }
8.
9. void EnterSpinlock (volatile word &seq, volatile word &cur)
10.{
11.word mySeq = fetch_and_add(seq, 1);
12.while (cur != mySeq) {}
13.}
14.
15.void LeaveSpinlock(volatile word &cur)
16.{
17.cur++;
18.}
Next is the almost ubiquitous compare and swap (CAS) operation. In this
operation, a value is read from memory, and if it matches a comparison value
in the processor, a third value in the processor is atomically written in the
memory location, and ultimately the original value in memory is returned to
the program. This operation is very popular because it allows you to perform
almost any operation on a single memory location atomically, by reading the
value, modifying it, and then compare-and-swapping it back to memory. For
this reason, it is considered to be a universal atomic primitive.

1. ATOMIC word compare_and_swap(volatile word &value, word


newValue, word comparand)
2. {
3. word oldValue = value;
4. if (oldValue == comparand)
5. value = newValue;
6.
7. return oldValue;
8. }
Some architectures (such as the x86) support double-width compare and swap
(atomic compare and swap of two consecutive words). While this is
convenient, it should not be relied upon in portable software, as many
architectures do not support it. Note that double-width compare and swap is
NOT the same as double compare and swap (which we'll look at soon).

Another universal atomic primitive is the load linked/store conditional (LL/SC)


pair of instructions. In this model, when a value is loaded linked into a register,
a reservation is placed on that memory address. If that reservation still exists
when the store conditional operation is performed, the store will be
performed. If another thread writes to the memory location between a load
linked and store conditional, the reservation is cleared, and any other threads
will fail their store conditional (resulting in skipping the store altogether). If the
store conditional fails, the program must loop back and try the operation
again, beginning with the load linked.

In theory, the load linked/store conditional primitive is superior to the


compare and swap operation. As any access to the memory address will clear
the reservations of other processors, the total number of reads is 1, in contrast
to the 2 reads with compare and swap (1 read to get the value initially, and a
second read to verify the value is unchanged). Furthermore, the LL/SC
operation may distinguish whether a write has occurred, regardless of whether
the value was changed by the write (we'll come back to why this is a good
thing when we discuss hazard pointers). Unfortunately, my research indicates
that most implementations of LL/SC do not guarantee that a write will be
distinguished if the written value is the same as the value at the time of the
reserve.

Finally, the wild card: the double compare and swap (DCAS). In a double
compare and swap, the compare and swap operation is performed
simultaneously on the values in two memory addresses. Obviously this
provides dramatically more power than any previous operation, which only
operate on single addresses. Unfortunately, support for this primitive is
extremely rare in real-world processors, and it is typically only used by lock-
free algorithm designers that are unable to reduce their algorithm to single-
address compare and swap operations.

1. ATOMIC void double_compare_and_swap(volatile word &value1,


word newValue1, word comparand1, word &oldValue1, volatile
word &value2, word newValue2, word comparand2, word
&oldValue2)
2. {
3. oldValue1 = value1;
4. oldValue2 = value2;
5.
6. if (oldValue1 == comparand1 && oldValue2 == comparand2)
7. {
8. value1 = newValue1;
9. value2 = newValue2;
10.}
11.}
Lastly, note that I'm using rather loose classification of these primitives.
Technically, you could place the swap (fetch and store) operations in the fetch
and phi family, as well, but it seemed more intuitive to me to separate them.

Ticket Lock
A ticket lock is a form of lockless inter-thread synchronization.

Overview

Conventionally, inter-thread synchronization is achieved by using


synchronization entities provided by the operating system, such as events,
semaphores and mutexes.For example, a thread will create a mutex and in the
act of creation, "claim" the mutex. A mutex can have only one owner at any
time. Other threads, when they come to the point where their behaviour must
be synchronized, will attempt to "claim" the mutex; if they cannot, because
another thread already owns the mutex, they automatically sleep until the
thread which currently owns the mutex gives it up. Then one of the currently
sleeping threads will automatically be awoken and given ownership of the
mutex.

A ticket lock works as follows; there are two integer values which begin at 0.
The first value is the queue ticket, the second is the dequeue ticket.

When a thread arrives, it atomically obtains and then increments the queue
ticket. It then atomically compares its ticket with the dequeue ticket. If they
are the same, the thread is permitted to enter the serialised code. If they are
not the same, then another thread must already be in the serialised code and
this thread must busy-wait or yield. When a thread has comes to leave the
serialised code, it atomically increments the dequeue ticket, thus permitting
the next waiting thread to enter the serialised code.

This approach is heavyweight (high-overhead, code intensive), in that such


entities have a significant impact upon the performance of the operating
system, since the operating system has to switch into a special mode when
dealing with operations upon synchronization entities to ensure they provide
synchronized behaviour.

A further drawback is that if the thread which owns the mutex fails, the entire
application halts. (This type of problem applies to all synchronization entities).

Lockless locking
Main article: Lock-free and wait-free algorithms

Lockless locking achieves inter-thread synchronization without the use of


operating system provided sychronization entities.Generally, two techniques
are involved.

The first technique is the use of a special set of instructions which are
guaranteed atomic by the CPU. This generally centers on an instrument known
as Compare-and-swap. This instruction compares two variables and if they are
the same, replaces one of the variable's values with a third value.

For example;compare_and_swap ( destination, exchange, comparand );

Here the destination and comparand are compared. If they are identical, the
value in exchange is placed in destination.

This first technique provides the vital inter-thread atomicity of operation. It


would otherwise be impossible to perform any lockless locking, since on
systems with multiple processors, even operations such as a simple assignment
(let a = b) could be half-way through occurring while other operations occur on
other processors; the software trying to sychronize across processors could
never be sure that any operation had reached a sane state permitting it to be
used in some way.

The second technique is to busy-wait or yield when it is clear that the current
thread cannot be permitted to continue processing. This provides the vital
ability to defer the processing done by a thread when such processing would
violate the operations which are being serialised between threads.
Test & Test & Set Lock
In computer science, the test-and-set CPU instruction is used to
implement mutual exclusion in multiprocessor environments. Although a
correct lock can be implemented with test-and-set, it can lead to memory
contention in busy lock (caused by bus locking and cache invalidation when
test-and-set operation needs to access memory atomically).

To lower the overhead a more elaborate locking protocol test and test-and-
set is used. The main idea is not to spin in test-and-set but increase the
likelihood of successful test-and-set by using the following entry protocol to
the lock:

boolean locked := false // shared lock variable procedure EnterCritical()


{ do { while (locked == true) skip // spin until
lock seems free } while TestAndSet(locked) // actual atomic locking }

Exit protocol is:

procedure ExitCritical() { locked := false }

The entry protocol uses normal memory reads to spin, waiting for the lock to
become free. Test-and-set is only used to try to get the lock when normal
memory read says it's free. Thus the expensive atomic memory operations
happens less often than in simple spin around test-and-set.

If the programming language used supports short-circuit evaluation, the entry


protocol could be implemented as:

procedure EnterCritical() { while ( locked == true or TestAndSet(locked) == true


) skip // spin until locked }
Array Based Locks
In array-based locks each lock is implemented as an array of size p where p is
the total number of processors. The lock acquire involves atomically copying
and incrementing a shared index variable and spinning on the array location
indexed by the copied value. Releasing a lock involves resetting the
corresponding array location and setting the next array location if a processor
is waiting. Each array location should be allocated on a different cache line to
avoid invalidations due to false-sharing. Assume a large enough cache line size
(say, 256 bytes) and put that amount of padding between consecutive lock
locations. Use LL/SC for implementing any atomic code section that may be a
part of lock acquire or release. The algorithm should be written in C with calls
to atomic primitives implemented with LL/SC in MIPS assembly.
It Solves the O (P2) traffic problem.

Barriers
Race conditions can cause a program to execute in a non-deterministic fashion,
producing inconsistent results. Synchronization routines are used to remove
race conditions from a code. In certain cases all threads must have executed a
portion of code before continuing. Barrier synchronization is a technique to
achieve this. A thread executing an episode of a barrier waits for all other
threads before proceeding to the next. Therefore, when a barrier is reached,
all threads are forced to wait for the last thread to arrive. Use of barriers is
common in shared memory parallel programming.

Some terminologies in the context of this report:


1. Threads – Units of execution initiated by the program using openMp which
might run on different processors but on the same parallel machine in which
they were initiated.
2. Parallel Machine – A collection of processors which are tightly coupled and
use shared memory. Jedi1 is considered a parallel machine. It has 4 processors.
3. Processor – A subunit of a parallel machine.

Two types of Barrier are

1.Centralized Barrier

2.Tree Barrier
Centralized Barrier

Each of the N threads entering the barrier atomically decrements a shared


integer, the initial value of which is N. If the value of the integer after
decrement is 0 then the thread resets the counter and changes a shared
central flag. Otherwise, the thread waits for notification. Arriving processor
decrements “count” and then wait until “sense” has a different value than it
had in the previous barrier. The last arriving processor resets “count” and
reverses “sense”. The algorithm relies on every thread reading and writing to a
single memory location: the counter. This memory location is known as a hot-
spot.

Tree Barrier

Does not need a lock, only uses flags


-Arrange the processors logically in a binary tree (higher degree also possible)
-Two siblings tell each other of arrival via simple flags (i.e. one waits on a flag
while the other sets it on arrival)
-One of them moves up the tree to participate in the next level of the barrier
- Introduces concurrency in the barrier algorithm since independent subtrees
can proceed in parallel
- Takes log(P) steps to complete the acquire
- A fixed processor starts a downward pass of release waking up other
processors that in turn set other flags
- Shows much better scalability compared to centralized barriers in DSM
multiprocessors; the advantage in small bus-based systems is not much, since
all transactions are any way serialized on the bus; in fact the additional log (P)
delay may hurt performance in bus-based SMPs

Binary Barrier Tree Signal

Binary barrier tree wakeup

Each processor will need at most log (P) + 1 flags


Avoid false sharing: allocate each processor’s flags on a separate chunk of
cache lines
With some memory wastage (possibly worth it) allocate each processor’s flags
on a separate page and map that page locally in that processor’s physical
memory.
Avoid remote misses in DSM multiprocessor
Does not matter in bus-based SMPs
What is Sense Reversal?
• Problem: reinitialization
—each time a barrier is used, it must be reset
• Difficulty: odd and even barriers can overlap in time
—some processes may still be exiting the kth barrier
—other processes may be entering the k+1st barrier
—how can one reinitialize?
• Solution: sense reversal
—terminal state of one phase is initial state of next phase
—e.g.
– odd barriers wait for flag to transition from true to false
– even barriers wait for flag to transition from false to true
• Benefits
—reduce number of references to shared variables
—eliminate one of the spinning episodes

You might also like