Module 5
Module 5
MODULE 5
PARALLEL MODELS, LANGUAGES, AND COMPILERS
Parallel programming models
Programming Model: A collection of program abstractions providing a simplified and transparent view of
the computer hardware/software system to the programmer.
1. Shared-Variable Model
Resources in Programming Systems
o ACTIVE RESOURCES – processors
o PASSIVE RESOURCES – memory and I/O devices
Processes
o A processes are the basic computational units in parallel program
o A program is a collection of processes
o Parallelism depends on implementation of IPC (inter-process communication)
Fundamentals issues around parallel programming
o Specification, Creation, Suspension, Reactivation, Migration, Termination and
Synchronization of concurrent processes
o Process address space may be shared or restricted by limiting scope and access rights
Multiprocessor programming is based on the use of shared variables in a common memory for IPC.
Main Issues:
Protected access of critical sections
Memory Consistency
Atomicity of Memory Operations
Fast synchronization
Shared data structures
Fast data movement techniques
Scheduling of divided program modules on parallel processors is much more complicated than
scheduling of sequential programs on a uniprocessor.
Static Scheduling: is conducted at post compile time. Its advantage is low overhead but the
shortcoming is potentially poor resource utilization.
Dynamic Scheduling: Dynamic scheduling catches the run-time conditions.
However, dynamic scheduling requires fast context switching, preemption, and much
more OS support.
The advantages of dynamic scheduling include better resource utilization at the expense
of higher scheduling overhead.
Cache Coherence and Protection
3. Data-Parallel Model
SIMD Lockstep operations
Parallelism explicitly handled by hardware synchronization and flow control
Synchronization done at compile time rather than at run time
Data parallel programs require the use of pre-distributed data sets
Choice of parallel data structure is an important consideration
Emphasis on local computation and data routing operations Can be applicable to fine-grain
problems
Implemented either on SIMD or SPMD
Leads to high degree of parallelism involving thousands of data operations concurrently
Data Parallelism
SIMD Characterization for data parallelism
Scalar operations and scalar data operands
Chang and Smith (1990) classified language features for parallel programming into 6
classes according to functionality:
Optimization features
Availability features
Synchronization/Communication features
Purpose:
Enhance user-friendliness
Make the language portable to a large class of parallel computers
Expand applicability of software libraries
Scalability: in terms of no. of processors available
Independence: From hardware topology
Compatibility: With an established sequential language
Portability: across shared-memory and message-passing computers
Synchronization/Communication features
Purpose: How data are accessed and distributed in SIMD or MIMD computers
Runtime automatic Decomposition: Express C
Mapping Specification: DINO
Virtual Processor Support :PISCES 2 and DINO
Direct Access to Shared Data: Linda
Where each ei is an arithmetic expression that must produce a scalar integer value.
The first expression e1 is a lower bound, the second e2 an upper bound, and the third e3 an
increment (stride).
For example, B(l : 4 : 3, 6 : 8 : 2, 3) represents four elements B(l, 6, 3), B(4 ,6, 3), B(l, 8,
3), and B(4, 8, 3) of a three-dimensional array.
When the third expression in a triplet is missing, a unit stride is assumed.
The * notation in the second expression indicates all elements in that dimension starting
from el, or the entire dimension if e1 is also omitted.
When both e2 and e3 are omitted, the e1 alone represents a single element in that
dimension. For example, A (5) represents the fifth element in the array A(3 : 7 : 2).
Array assignments are permitted under the following constraints:
The array expression on the right must have the same shape and the same number of
elements as the array on the left.
For example, the assignment A(2 : 4, 5 : 8) =A(3 : 5, 1 : 4)
When a scalar is assigned to an array, the value of the scalar is assigned to every element
of the array.
For instance, the statement B(3 : 4, 5) = 0 sets B(3, 5) and B(4, 5) to 0.
(Doall, Endall )
To declare parallel activities, we use the (Doall, Endall) pair.
All iterations in the Doall loop are totally independent of each other.
This implies that they can be executed in parallel if there are sufficient processors to
handle different iterations.
However. the computations within each iteration are still executed serially in program
order.
(Doacross, Endacross)
When the successive iterations of a loop depend on each other, we use the (Doacross,
Endacross) pair to declare parallelism with loop-carried dependences.
Synchronizalions must be performed between the iterations that depend on each other.
For example, dependence along the J-dimension exists in the following program.
We use Doacross to declare parallelism along the I-dimension, but synchronization
between iterations is required.
(Cobegin, Coend)
All computations specified within the block could be executed in parallel.
But parallel processes may be created with a slight time difference in real
implementations.
Program Flow Analysis: This phase reveals the program flow patterns in order to determine
data and control dependences in the source code.
Scalar dependence analysis is extended below to structured data arrays or matrices.
Depending on the machine structure, the granularities of parallelism to be exploited are
quite different.
Thus the flow analysis is conducted at different execution levels on different parallel
computers.
The flow analysis must also reveal code/data reuse and memory-access patterns.
Program Optimization:
This refers to the transformation of user programs in order to explore the hardware
capabilities as much as possible.
In reality, most transformations are constrained by the machine architecture.
The ultimate goal of program optimization is to maximize the speed of code execution.
This involves the minimization of code length and of memory accesses and the
exploitation of parallelism in programs.
The optimization techniques include vectorization using pipelined hardware and
parallelization using multiple processors simultaneously.
Machine-dependent transformations are meant to achieve more efficient allocation of
machine resources, such as processors, memory, registers, and functional units.
Iteration Space
The n-dimensional discrete Cartesian space for n deep loops is called an iteration space.
The iteration is represented as coordinates in the iteration space. The following example
clarifies the concept of lexicographic order for the successive iterations in a loop nest.
Dependence Equations
Let α and β be vectors of n integer indices within the ranges of the upper and lower
bounds of the n loops.
There is a dependence from S1 to S2 if and only if there exist α and β such that α is
lexicographically less than or equal to β and the following system of dependence
equations is satisfied:
Example
The distance and direction vectors for the dependence between iterations along three dimensions
of the array A are (1, 0, -l) and (<, = , >), respectively.
Subscript Separability
When testing multidimensional arrays, we say that a subscript position is separable if its
indices do not occur in the other subscripts.
If two different subscripts contain the same index, we say they are coupled.
If all the subscripts are separable, we may compute the direction vector for each subscript
independently and merge the direction vectors on a positional basis with full precision.
The first subscript is separable because index i does not appear in the other dimensions,
but the second and third are coupled because they both contain the index j.
The ZIV subscripts are separable because they contain no indices.
Subscript Partitioning
We need to classify all the subscripts in a pair of array references as separable or as part
of some minimal coupled group.
A coupled group is minimal if it cannot be partitioned into two nonempty subgroups with
distinct sets of indices.
Once a partition is achieved, each separable subscript and each coupled group has
completely disjoint sets of indices.
Each partition may then be tested in isolation and the resulting distance or direction
vectors merged without any loss of precision.
Test Categories
The ZIV Test
The SIV Test
The MIV Test
A dependence exists if and only if d is an integer and |d| = U - L, where U and L are the
loop upper and lower bounds.
For dependences that do exist, the dependence direction is given by
The strong SIV test is thus an exact test that can be implemented very efficiently in a few
operations.
A bounded iteration space is shown in Fig. 10.6a. The case of a strong SIV test is shown
in Fig. 10.6b.
This corresponds to the intersection of the dependence equation with the line i = i’.
To determine whether dependences exist, we simply need to check that the resulting
value i is within the loop bounds and is either an integer or has a noninteger part equal to
1/2.
Starting at the outermost loop nest and working inward, we replace each index in a loop
upper bound with its maximum value (or minimal if it is a negative term).
We do the opposite in the lower bound, replacing each index with its minimal value (or
maximal if it is a negative team).
We evaluate the resulting expressions to calculate the minimal and maximal values for
the loop index and then repeat for the next inner loop.
This algorithm returns the maximal range for each index, all that is needed for SIV tests.
A special case of MIV subscripts, called RDlV (restricted double—index variable)
subscripts, have the form ( a1i + c1, a2j + c2). They are similar to SIV subscripts except
that i and j are distinct indices.
The development of a parallelizing compiler is limited by the difficulty of having to deal
with many nonperfectly nested loops.
The lack of dataflow information is often the ultimate limit on automatic compilation of
parallel code.
A basic environment provides a simple program tracing facility for debugging and
performance monitoring or a graphic mechanism for specifying the task dependence
graph.
o The software and programming environments of the Cray Y-MP, Intel Paragon XP/S, and
Connection Machine CM-5 are examined below.
Cray Y-MP Software
o The Cray Y-MP ran with the Cray operating systems UNICOS or COS.
o Two Fortran compilers CFT77 and CFT, provided automatic vectorizing, as did C and
Pascal compilers.
o Large library routines, program management utilities, debugging aids, and a Cray
assembler were included in the system software.
o Communications and applications software were also provided by Cray, by a third party,
or from the public domain.
o UNICOS for Y-MP was written in C and was a time-sharing OS extension of UNIX.
UNTCOS supported optimizing, vectorizing, concurrentizing Fortran compilers and
optimizing and vectorizing Pascal and C compilers.
o COS was a multiprogramming, multiprocessing, and multitasking OS. It offered a batch
environment to the user and supported interactive jobs and data transfers through the
front-end system.
o COS programs could run with a maximum of four processors up to 16M words of
memory on the Y-MP system.
o Migration tools were available to ease conversion from COS to UNICOS.
o CFT77 was a multipass, optimizing, transportable compiler. It carried out vectorization
and scalar optimization of Fortran 77 programs.
lntel Paragon XP/S Software
o The Intel Paragon XP/S system was an extension of the Intel Ipsc/860 and Delta systems.
o It was claimed to be a scalable, wormhole, mesh-connected multicomputer using
distributed memory.
o The processing nodes used were 50-MHz i860 XP processors.
o Paragon ran a distributed UNIX/OS based on OSF technology and was in conformance
with POSIX, System V.3, and Berkeley 4.3BSD.
o The languages supported included C, Fortran, Ada, C++, and data-parallel Fortran.
o The integrated tool suite included FORGE and CAST parallelization tools, Intel
ProSolver parallel equation solvers, and BLAS, NAG, SEGlib, and other math libraries.
o The programming environment provided an interactive parallel debugger (IPD) with a
hardware-aided performance visualization system (PVS).
CM-5 Software
o The software layers of the Connection Machine systems are shown in Fig. 11.3.
o The operating system used was CMOST, an enhanced version of UNIX/OS which
supported time-sharing and batch processing.
o The low-level languages and libraries matched the hardware instruction-set architectures.
o CMOST provided not only standard UNIX services but also supported fast IPC
capabilities and dataparallel and other parallel programming models.
o It also extended the UNIX I/O environment to support parallel reads and writes and
managed large files on data vaults.
o The Prism programming environment was an integrated Motif-based graphical
environment.
o Users could develop, execute, debug, and analyze the performance of programs with
Prism, which could operate on terminals or workstations running the X-Window system.
o High-level languages supported on the CM-5 included CM Fortran, C++, and *Lisp.
o CM Fortran was based on standard Fortran 77 with array processing extensions of
standard Fortran 90. CM Fortran also offered several extensions beyond Fortran 90, such
as a FORALL statement and some additional intrinsic functions.
o C* was an extension of the C programming language which supported data-parallel
programming. Similarly, *Lisp was an extension of the Common Lisp programming
language for data-parallel programming.
o Both could be used to structure parallel data and to compute, communicate, and transfer
data in parallel.
o Three synchronization methods are described below based on who updates the atom and
whether sole access is granted before or after the atomic operations:
(1) Lock Synchronization: In this method, the atom is updated by the requester
process and sole access is granted before the atomic operation. For this reason,
it is also called pre-synchronization. The method can be applied for shared
read-only access.
(2) Optimistic Synchronization: This method also updates the atom by the
requester process. But sole access is granted after the atomic operation, as
described below. It is also called post-synchronization.
o A process may secure sole access after first completing an atomic
operation on a local version of the atom and then executing another
atomic operation on the global version of the atom.
o The second atomic operation determines if a concurrent update of the
atom has been made since the first operation was begun. If a
concurrent update has taken place, the global version is not updated;
instead, the first atomic operation is aborted and restarted from the
new global version.
The name optimistic is due to the fact that the method expects that
there will be no concurrent access to the atom during the execution
of a single atomic operation.
(3) Server Synchronization: This method updates the atom by the server process
of the requesting process, as suggested by the name. Compared with lock
synchronization and optimistic synchronization, server synchronization offers
full service.
An atom has a unique update server. A process requesting an atomic
operation on the atom sends the request to the atom‘s update server.
The update server may be a specialized server processor (SP) associated
with the atom‘s memory module.
Multitasking Environments
Multitasking exploits parallelism at several levels:
- Functional units are pipelined or chained together.
- Multiple functional units are used concurrently.
- HO and CPU activities are overlapped.
- Multiple CPUs cooperate on a single program to achieve minimal execution time.
In a multitasking environment, the tasks and data structures of a job must be properly
partitioned to allow parallel execution without conflict.
However, the availability of processors, the order of execution, and the completion of
tasks are functions of the run-time conditions of the machine. Therefore, multitasking is
nondeterministic with respect to time.
On the other hand, tasks themselves must be deterministic with respect to results.
To ensure successful multitasking, the user must precisely define and include the
necessary communication and synchronization mechanisms and provide protection for
shared data in critical sections.
Figure 11.4a
A main program forks a subroutine Si and then forks out three additional subroutines S2,
S3 and S4. Macrotasking is best suited to programs with larger, longer-running tasks.
The execution of these four subroutine -calls on a uniprocessor (processor 0) is done
sequentially. Macrotasking can spread the calls across four processors.
Note that each processor may have a different execution time. However, the overall
execution time is reduced due to parallel processing.
Microtasking
This corresponds to multitasking at the loop control level with finer granularity. Compiler
directives are often used to declare parallel execution of independent or dependent
iterations of a looping program construct.
Figure 11.4b illustrates the spread of every four instructions of a Do loop to four
processors simultaneously through microtasking.
When the iterations are independent of each other, microtasking is easier to implement.
When dependence does exist between the iterations, the system must resolve the
dependence before parallel execution can be continued.
Interprocessor communication is needed to resolve the dependences.
In addition to working efficiently on parts of programs where the granularity is small,
microtasking works well when the number of processors available for the job is unknown
or may vary during the program's execution.
Figure 11.4b
Autotasking
The autotasking feature automatically divides a program into discrete tasks for parallel
execution on a multiprocessor. In the past, macrotasking was achieved by direct
programmer intervention. Microtasking was aided by an interactive compiler.
Autotasking demands much higher degrees of automation. Only some Cray
multiprocessors, like the Cray Y-MP and C-90, were provided with autotasking software.
Autotasking is based on the microtasking design and shares several advantages with
microtasking: very low overhead synchronization cost, excellent dynamic performance
independent of the number of processors available, both large and small granularity
parallelism, and so on.
COMPUTER ARCHITECTURE
a) We define computer architecture as the arrangement by which the various system
building blocks like processors, functional units, main memory, cache, data paths,
and so on—are interconnected and interoperated to achieve desired system
performance.
b) Processors make up the most important part of a computer system. Therefore, in
addition to (a), processor design also constitutes a central and very important element
In Fig. 12.1 we see the schematic diagram of a simple computer system consisting of four
processors, a large shared main memory, and a processor-memory bus.
For the three subsystems, we assume the following performance figures:
i. Each of the four processors can perform double precision floating point
operations at the rate of 500 million per second.
ii. The shared main memory can read/write data at the aggregate rate of 1000
million 32-bit words per second.
iii. The processor-memory bus has the capability of transferring 500 million
32-bit words per second to/from main memory.
This system exhibits a performance mismatch between the processors, main memory, and
the processor memory bus.
The system architect must pay careful attention to all such potential mismatches in
system design. Otherwise, the sustained performance which the system can deliver can
only equal the performance of the slowest part of the system—i.e. the bottleneck.
In this particular case, a better investment for increased system performance could be
a) Faster processor-memory bus, and
b) Improved cache memory with each processor.
At a given time, VLSl technology limits the left hand side in the above equation s, while
the designer must select the two factors on the right. Aggressive exploitation of
instruction level parallelism, with multiple functional units and more complex control
logic, increases the chip area—and transistor count—per processor core.
Of course system design would involve issues which are more complex than these, but a
basic design issue is seen here:
o How should the designers divide available chip resources among processors and,
within a single processor, among its various functional elements?
o Within a processor, a set of instructions are in various stages of execution at a
given time—within the pipeline stages, functional un its, operation buffers,
reservation stations, and so on. Recall that functional units themselves may also
be internally pipelined.
o To maintain the work flow oi‘ instructions within the processor, a superscalar
processor makes use of branch prediction.
PROBLEM DEFINITION
we must address the problem of detecting and exploiting the parallelism which is implicit
within the instruction stream.
We need a prototype instruction for our processor. We assume that the processor has a
load-store type of instruction set, which means that all arithmetic and logical operations
are carried out on operands which are present in programmable registers.
Operands are transferred between main memory and registers by load and store
instructions only.
We assume a three-address instruction format, as seen on most RISC processors, so that a
typical instruction for arithmetic or logical operation has the format:
Data transfer instructions have only two operands—source and destination registers; load
and store instructions to/from main memory specify one operand in the form of a memory
address, using an available addressing mode.
Effective address for load and store is calculated at the time of instruction execution.
Conditional branch instructions need to be treated as a special category.
Review of dependence basic concepts
Data Dependences: Assume that instruction Ik follows instruction Ij in the program. Data
dependence between Ij and Ik means that both access a common operand.
Of these, the first pattern of register access does not in fact; create a dependence, since
the two instructions can read the common value of the operand in any order.
The other three patterns of operand access do create dependences amongst instructions.
Based on the underlined words shown above, these are known as RAW dependence,
WAR dependence, and WAW dependence, respectively.
Sometimes we need to show dependences between instructions using graphical notation.
We shall use small circles to represent instructions, and double line arrows between two
circles to denote dependences.
The instruction at the head of the arrow is dependent on the instruction at the tail; if
necessary, the type of dependence between instructions may be shown by appropriate
notation next to the arrow.
A missing arrow between two instructions will mean explicit absence of dependence.
Single line arrows will be u.sod between instructions when we wish to denote program
order without any implied dependence or absence of dependence.
Figure 12.2 illustrates this notation.
Often dependences are thus depicted in a basic block of instructions— i.e. a sequence of
instructions with entry only at the first instruction, and exit only at the last instruction of
the sequence.
o Part (a) of Fig. 12.3 shows a basic block of six instructions, denoted I1 through I6
in program order.
o Part (b) of the figure shows a possible pattern of dependences as they may exist
amongst these six instructions. For simplicity, we have not shown the type of each
dependence, e.g. RAW(R3), etc.
Control Dependences
In typical application programs, basic blocks tend to be small in length, since about 15%
to 20% instructions in programs are branch and jump instructions, with indirect jumps
and returns from procedure calls also included in the latter category.
Because of typically small sizes of basic blocks in programs, the amount of instruction
level parallelism which can be exploited in a single basic block is limited.
Assume that instruction Ij is a conditional branch and that, whether another instruction Ik
executes or not depends on the outcome of the conditional branch instruction Ij.
In such a case, we say that there is a control dependence of instruction I k on Ij.
Resource Dependences
This is possibly the simplest kind of dependence to understand, since it refers to a
resource constraint causing dependence amongst instructions needing the resource.
A resource dependence which results in a pipeline stall can arise for access to any
processor resource— functional unit, data path, register bank, and so on.
We can certainly say that such resource dependences will arise if hardware resources
provided on the processor do not match the needs of the executing program.
MODEL OF A TYPICAL PROCESSOR
In the process of translating a sequential source program into machine language, the
compiler performs extensive syntactic and semantic analysis of the source program.
There are several ways in which the compiler can contribute to the exploitation of
implicit instruction level parallelism.
One relatively simple technique which the compiler can employ is known as loop
unrolling, by which independent instructions from multiple successive iterations of a loop
can be made to execute in parallel.
Unrolling means that the body of the loop is repeated n times for n successive values of
the control variable— so that one iteration of the transformed loop performs foe work of
n iterations of the original loop.
Example: Loop unrolling
Consider the following body of a loop in a user program, where all the variables except
the loop control variable i are assumed to be floating point:
Now suppose that machine code is generated by the compiler as though the original
program had been written as:
In the unrolled program fragment, the loop contains four independent instances of the
original loop body—indeed this is the meaning of loop unrolling.
Suppose machine code corresponding to the second program fragment is executing on a
processor. Then clearly if the processor has sufficient floating point arithmetic
resources—instructions from the four loop iterations can he in progress in parallel on the
various functional units.
It is clear that code length of the machine language program increases as a result of loop
unrolling; this increase may have an effect on the cache hit ratio.
Also, more registers are needed to exploit the instruction level parallelism within the
longer unrolled loop.
OPERAND FORWARDING
Operand forwarding, which helps in reducing the impact of true data dependences in the
instruction stream.
Consider the following simple sequence of two instructions in a running program:
The result of the ADD instruction is stored in destination register R3, and then shifted
right by four bits in the second instruction, with the shifted value being placed in R4.
Thus, there is a simple RAW dependence between the two instructions—the output of the
first is required as input operand of the second.
In terms of our notation, this RAW dependence appears as shown in Fig. 12.6, in the
form of a graph with two nodes and one edge.
REORDER BUFFER
Since instructions execute in parallel on multiple functional units, the reorder buffer serves
the function of bringing completed instructions back into an order which is consistent with
program order.
Entries in the reorder buffer are completed instructions, which are queued in program order.
However, since instructions do not necessarily complete in program order, we also need a
flag with each reorder buffer entry to indicate whether the instruction in that position has
completed.
Figure 12.9 shows a reorder buffer of size eight. Four fields are shown with each entry in the
reorder buffer—instruction identifier, value computed, program-specified destination of the
value computed, and a flag indicating whether the instruction has completed
In Fig. 12.9, the head of queue of instructions is shown at the top, arbitrarily labeled as
instr[i].
This is the instruction which would be committed nest—-if it has completed execution. When
this instruction commits, its result value is copied to its destination, and the instruction is
then removed from the reorder buffer.
The nest instruction to be issued in the issue stage of the instruction pipeline then joins the
reorder buffer at its tail.
If the instruction at the head of the queue has not completed, and the reorder butter is full,
then further issue of instructions is held up—i.e. the pipeline stalls—because there is no free
space in the reorder buffer for one more entry.
We now take a brief look at how the use of reorder buffer addresses the various types of
dependences in the program.
(i) Data Dependence: A RAW dependence—i.e. true data dependence—will hold up the
execution of the dependent instruction if the result value required as its input operand is
not available. As suggested above, operand forwarding can be added to this scheme to
speed up the supply of the needed input operand as soon as its value has been computed.
(ii) Control Dependence: Suppose the instruction (s) in the reorder buffer belong to a
branch in the program which should it have been taken—i.e. there has been a mis-
predicted branch.
o Clearly then the reorder buffer should be flushed along with other elements of the
pipeline. Therefore the performance impact of control dependences in the running
program is determined by the accuracy of branch prediction technique employed.
o The reorder buffer plays no direct role in the handling of control dependences.
(iii) Resource Dependences: If an instruction needs a functional unit to execute, but the
unit is not free, then the instruction must wait for the unit to become free clearly no
technique in the world can change that. In such cases, the processor designer can aim to
achieve at least this: if a subsequent instruction needs to use another functional unit
which is free, then the subsequent instruction can be executed out of order.
REGISTER RENAMING
Amongst the instructions in various stages of execution within the processor, there would
be occurrences of RAW, WAR and WAW dependences on programmable registers.
As we have seen, RAW is true data dependence—since a value written by one instruction
is used as an input operand by another. But a WAR or WAW dependence can be avoided
if we have more registers to work with.
We can simply remove such a dependence by getting the two instructions in question to
use two different registers.
But we must also assume that the instruction set architecture (ISA) of the processor is
fixed—i.e. we cannot change it to allow access to a larger number of programmable
registers.
Therefore the only way to make a larger number of registers available to instructions
under execution within the processor is to make the additional registers invisible to
machine language instructions.
Instructions under execution would use these additional registers, even if instructions
making up the machine language program stored in memory cannot refer to them.
Working of register renaming for WAW dependence
For example, let us say that the instruction:
Both these instructions are writing to register R5, creating thereby a WAW dependence—
i.e. output dependence—on register R5. Clearly, any subsequent instruction should read
the value written into R5 by FSUB, and E the value written by FADD.
Figure 12.10 shows this dependence in graphical notation.
With additional registers available for use as these instructions execute, we have a simple
technique to remove this output dependence.
Let FSUB write its output value to a register other than R5, and let us call that other
register X. Then the instructions which use the value generated by FSUB will refer to X,
while the instructions which use the value generated by FADD will continue to refer to
R5.
Now since FADD and FSUB are writing to two different registers, the output dependence
or WAW between them has been removed.
When FSUB commits, then the value in R5 should be updated by the value in X—i.e. the
value computed by FSUB. Then the physical register X, which is not a program visible
register, can be freed up for use in another such situation.
Note that here we have mapped—or renamed R5 to X, for the purpose storing the result
of FSUB and thereby removed the WAW dependence from the instruction stream.
The first FADD instruction is writing a value into R2, which the second FADD
instruction is using, i.e. there is true data dependence between these two instructions.
Let us assume that, when the first FADD instruction executes, R2 is mapped into
program invisible register Xm. The latter FSUB instruction is writing another value into
R2.
Clearly, the second FADD (and other intervening instructions before FSUB) should see
the value in R2 which is written by the first FADD and not the value written by FSUB.
Figure 12.11 shows these two dependences in graphical notation.
Instructions which use the value written by FSUB refer to Xn, while instructions which
use the value written by the first FADD refer to Xm.
The WAR dependence between the second FADD and FSUB is removed; but the RAW
dependence between the two FADD instructions is respected via Xm.
When the first FADD commits, the value in Xm is transferred to R2 and program
invisible register Xm is freed up; likewise, later when FSUB commits, the value in X n is
transferred to R2 and program invisible register Xm is freed up.
TOMASULO’S ALGORITHM
The processor in system was designed with multiple floating point units, and it made use
of an innovative algorithm for the efficient use of these units.
The algorithm was based on operand forwarding over a common data bus, with tags to
identity sources of data values sent over the bus.
The algorithm has since become known as tomasulo’s algorithm, after the name of its
chief designer, what we now understand as register ramming was also an implicit part of
the original algorithm.
Recall that, for register renaming, we need a set of program invisible registers to which
programmable registers are re-mapped.
Tomasulo‘s algorithm requires these program invisible registers to be provided with
reservation stations of functional units.
Figure 12.12 shows such a functional unit connected to the common data bus, with three
reservation stations provided on it.
When the needed operand value or values are available in a reservation station, the
functional unit can initiate the required operation in the next clock cycle.
At the time of instruction issue, the reservation station is filled out with the operation.
code (op).
If an operand value is available, for example in a programmable register, it is transferred
to the corresponding source operand field in the reservation station.
However, if the operand value is not available at the time of issue, the corresponding
source tag (t1 and/or t2) is copied into the reservation station.
The source tag identifies the source of the required operand. As soon as the required
operand value is available at its source—which would be typically the output of a
functional unit—the data value is forwarded over the common data bus, along with the
source tag.
This value is copied into all the reservation station operand slots which have the matching
tag.
Thus operand forwarding is achieved here with the use of tags. All the destinations which
require a data value receive it in the same clock cycle over the common data bus, by
matching their stored operand tags with the source tag sent out over the bus.
Example Tomasulo's algorithm and RAW dependence
Assume that instruction I1 is to write its result into R4, and that two subsequent instructions
I2 and I3 are to read—i.e. make use of—that result value.
Thus instructions I2 and I3 are truly data dependent (RAW) dependence) on instruction I1.
See Fig. 12.13.
Assume that the value in R4 is not available when 12 and I3 are issued; the reason could
be, for example, that one of the operands needed for Il is itself not available. Thus we
assume that Il has not even started executing when I2 and I3 are issued.
When I2 and I3 are issued, they are parked in the reservation stations of the appropriate
functional units.
Since the required result value from I1 is not available, these reservation station entries of
I2 and I3 get source tag corresponding to the output of I1 —i.e. output of the functional
unit which is performing the operation of Il .
When the result of I1 becomes available at its functional unit, it is sent over the common
data bus along with the tag value of its source—i.e. output of functional unit.
At this point, programmable register R4 as well as the reservation stations assigned to I2
and I3 have the matching source tag—since they are waiting for the result value, which is
being computed by I1.
When the tag sent over the common data bus matches the tag in any destination, the data
value on the bus is copied from the bus into the destination.
The copy occurs at the same time into all the destinations which require that data value.
Thus R4 as well as the two reservation stations holding I2 and I3 receive the required
data value, which has been computed by Il , at the same time over the common data bits.
Thus, through the use of source tags and the common data bus, in one clock cycle, three
destination registers receive the value produced by Il—programmable register R4, and
the operand registers in the reservation stations assigned to I2 and I3.
BRANCH PREDICTION
A basic branch prediction technique uses a so-called two bit predictor. A two-bit counter
is maintained for every conditional branch instruction in the program.
The two-bit counter has four possible states; these four states and the possible transitions
between these states are shown in Fig. 12.15.
When the counter state is 0 or 1, the respective branch is predicted as taken; when the
counter state is 2 or 3, the branch is predicted as not taken.
When the conditional branch instruction is executed and the actual branch outcome is
known, the state of the respective two-bit counter is changed as shown in the figure using
solid and broken line arrows.
When two successive predictions come out wrong, the prediction is changed from branch
taken to branch not taken, and vice versa.
In Fig. 12.15, state transitions made on mis-predictions are shown using broken line
arrows, while solid line arrows show state transitions made on predictions which come
out right.