Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
306 views

Module 5

This document discusses various parallel programming models including: 1) Shared-variable model which uses shared memory for inter-process communication on multiprocessors. It discusses issues like critical sections, memory consistency, and synchronization. 2) Message passing model where processes communicate through message passing and issues around distributing work and balancing computation vs communication. 3) Data-parallel model using SIMD and SPMD approaches for fine-grained parallelism on large data sets through explicit data routing and pre-distributed data. 4) Object-oriented model providing concurrency through message passing between dynamically created objects following patterns like pipeline, divide-and-conquer, and cooperative problem solving. 5) Functional and logic models emphasizing
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
306 views

Module 5

This document discusses various parallel programming models including: 1) Shared-variable model which uses shared memory for inter-process communication on multiprocessors. It discusses issues like critical sections, memory consistency, and synchronization. 2) Message passing model where processes communicate through message passing and issues around distributing work and balancing computation vs communication. 3) Data-parallel model using SIMD and SPMD approaches for fine-grained parallelism on large data sets through explicit data routing and pre-distributed data. 4) Object-oriented model providing concurrency through message passing between dynamically created objects following patterns like pipeline, divide-and-conquer, and cooperative problem solving. 5) Functional and logic models emphasizing
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Advanced Computer Architecture 15CS72

MODULE 5
PARALLEL MODELS, LANGUAGES, AND COMPILERS
Parallel programming models
Programming Model: A collection of program abstractions providing a simplified and transparent view of
the computer hardware/software system to the programmer.

Specifically designed for:


 Multiprocessor, Multicomputer and Vector/SIMD Computer
Basic Models
 Shared-Variable Model
 Message-Passing Model
 Data-Parallel Model
 Object-Oriented Model
 Functional and Logic Models

1. Shared-Variable Model
Resources in Programming Systems
o ACTIVE RESOURCES – processors
o PASSIVE RESOURCES – memory and I/O devices
Processes
o A processes are the basic computational units in parallel program
o A program is a collection of processes
o Parallelism depends on implementation of IPC (inter-process communication)
Fundamentals issues around parallel programming
o Specification, Creation, Suspension, Reactivation, Migration, Termination and
Synchronization of concurrent processes
o Process address space may be shared or restricted by limiting scope and access rights

Multiprocessor programming is based on the use of shared variables in a common memory for IPC.
Main Issues:
 Protected access of critical sections
 Memory Consistency
 Atomicity of Memory Operations
 Fast synchronization
 Shared data structures
 Fast data movement techniques

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 1


Advanced Computer Architecture 15CS72

Critical Section (CS)


 Critical section (CS) is a code segments accessing shared variables, which must be executed by
only one process at a time and which, once started, must be completed without interruption.
 Requirements
 Mutual Exclusion: at most one process executing the CS at a time.
 No Deadlock in waiting
 Non-preemption,
 Eventual Entry
Protected Access
 The main problem associated with the use of a CS is avoiding race conditions where concurrent
processes executing in different orders produce different results.
 If the CS is large, it may limit parallelism due to excessive waiting by competing processes.
 When the CS is too small. it may add unnecessary code complexity or software overhead.
 Binary and counting semaphores are used to implement CSs and to avoid system deadlocks.
 Operational Modes used in programming multiprocessor systems
 Multiprogramming
 Multiprocessing
 Multitasking
 Multithreading
Partitioning and Replication
Program Partitioning: is a technique for decomposing a large program and data set into many small
pieces for parallel execution by multiple processors.
Program Replication: Program replication refers to duplication of the same program code for parallel
execution on multiple processors over different data sets.

 Partitioning is often practiced on a shared-memory multiprocessor system,


 While replication is more suitable for distributed-memory message-passing multicomputers.
Scheduling and Synchronization

 Scheduling of divided program modules on parallel processors is much more complicated than
scheduling of sequential programs on a uniprocessor.
 Static Scheduling: is conducted at post compile time. Its advantage is low overhead but the
shortcoming is potentially poor resource utilization.
 Dynamic Scheduling: Dynamic scheduling catches the run-time conditions.
 However, dynamic scheduling requires fast context switching, preemption, and much
more OS support.
 The advantages of dynamic scheduling include better resource utilization at the expense
of higher scheduling overhead.
Cache Coherence and Protection

 Multiprocessor Cache Coherence: multiprocessors must assume data consistency between


private caches and the shared memory.
 The multicache coherence problem demands an invalidation or update after each write operation.
 Sequential Consistency Model: demands that all memory accesses be strongly ordered on a
global basis.
 Weak Consistency Model: enforces ordering and coherence at explicit synchronization points
only.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 2


Advanced Computer Architecture 15CS72

2. Message Passing Model

 Message may be: Instructions, Data, Synchronization Signals or Interrupt Signals


 Communication Delays caused by message passing is more than that caused by accessing
shared variables
 Message passing Models
 Synchronous Message Passing: blocking communication
 Asynchronous Message Passing: non blocking
 Critical issue in programming this model
 How to distribute or duplicate program codes and data sets over processing nodes?
 Trade-offs between computation time and communication overhead to be considered

 Synchronous Message Passing


 Sender and receiver processes synchronized in time and spac
 No need of mutual exclusion
 No buffers required
 Blocking communication scheme
 Uniform Communication Delays, in general
 Asynchronous Message Passing
 Sender and receiver processes do not require to be synchronized in time and space
 Often uses buffers in channels
 Non-blocking communication scheme
 Non-uniform or arbitrary communication delays

3. Data-Parallel Model
 SIMD Lockstep operations
 Parallelism explicitly handled by hardware synchronization and flow control
 Synchronization done at compile time rather than at run time
 Data parallel programs require the use of pre-distributed data sets
 Choice of parallel data structure is an important consideration
 Emphasis on local computation and data routing operations Can be applicable to fine-grain
problems
 Implemented either on SIMD or SPMD
 Leads to high degree of parallelism involving thousands of data operations concurrently
Data Parallelism
 SIMD Characterization for data parallelism
 Scalar operations and scalar data operands

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 3


Advanced Computer Architecture 15CS72

 Vector operations and vector data operands


 Constant data operands
 Masking operations
 Data-routing operations
Array Language Extensions for data-parallel processing
 Expected Characteristics
 Global Address Space
 Explicit data routing among PEs
 No. of PEs be the function of problem size rather than machine size
Compiler Support
 Array language expressions and their optimizing compilers must be embedded in familiar
standards such as Fortran and C with an idea to:
 Unify program execution model
 Facilitate precise control of massively parallel hardware
 Enable incremental migration to data parallel execution
4. Object-Oriented Model
 In this model, objects are dynamically created and manipulated.
 Processing is performed by sending and receiving messages among objects.
 Concurrent OOP
o Objects are program entities which encapsulate data and operations into single
computational units. It turns out that concurrency is a natural consequence of the concept
of objects.
o In fact, the concurrent use of coroutines in conventional programming is very similar to
the concurrent manipulation of objects in OOP.
 The development of concurrent object-oriented programming (COOP) provides an alternative
model for concurrent computing on multiprocessors or on multicomputers.
 Various object models differ in the internal behavior of objects and in how they interact with each
other.
Parallelism in COOP
Three common patterns of parallelism have been found in the practice of COOP.
 First, pipeline concurrency involves the overlapped enumeration of successive solutions
and concurrent testing of the solutions as they emerge from an evaluation pipeline.
 Second, divide-and-conquer concurrency involves the concurrent elaboration of different
subprograms and the combining of their solutions to produce a solution to the overall
problem.
 A third pattern is called cooperative problem solving. In this case, all objects must
interact with each other; intermediate results are stored in objects and shared by passing
messages between them.
Example:

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 4


Advanced Computer Architecture 15CS72

5.Functional and Logic Models


Functional Programming Model
 A functional programming language emphasizes the functionality of a program and
should not produce side effects after execution.
 The evaluation of a function produces the same value regardless of the order in which its
arguments are evaluated.
 This implies that all arguments in a dynamically created structure of a functional program
can be evaluated in parallel.
 The functional programming models can be easily applied to data-driven multiprocessors.
Logic Programming Model
 Logic programming is suitable for knowledge processing dealing with large databases.
 This model adopts an implicit search strategy and supports parallelism in the logic
inference process.
 Clauses in logic programming can be transformed into dataflow graphs.
 Concurrent Prolog, developed by Shapiro (1986), and Parlog, introduced by Clark
(1931), are two parallel logic programming languages.

PARALLEL LANGUAGES AND COMPILERS

 Chang and Smith (1990) classified language features for parallel programming into 6
classes according to functionality:
 Optimization features
 Availability features
 Synchronization/Communication features

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 5


Advanced Computer Architecture 15CS72

 Parallelism Control features


 Data Parallelism features
 Process Management features
Optimization features

 Used for Program restructuring and compilation directives in converting sequentially


coded program into parallel forms
 Purpose: to match software parallelism with hardware parallelism in target machine
 Automated parallelizer: Express C, Alliant FX Fortran compiler
 Semi-automated parallelizer: DINO
 Interactive restructure support: MIMDizer
Availability features

 Purpose:
 Enhance user-friendliness
 Make the language portable to a large class of parallel computers
 Expand applicability of software libraries
 Scalability: in terms of no. of processors available
 Independence: From hardware topology
 Compatibility: With an established sequential language
 Portability: across shared-memory and message-passing computers
Synchronization/Communication features

 Single assignment languages


 Shared variables (locks) for IPC
 Logically shared memory (e.g. tuple space in Linda)
 Send/receive primitives for message passing
 Remote procedure call
 Data flow languages
Parallelism Control features

 Coarse, Medium or Fine grain parallelism


 Explicit or Implicit parallelism
 Global parallelism
 Loop parallelism
 Task-split parallelism
 Shared task queue
 Divide-and-conquer paradigm
 Shared abstract data types
 Task dependency specification
Data Parallelism features

 Purpose: How data are accessed and distributed in SIMD or MIMD computers
 Runtime automatic Decomposition: Express C
 Mapping Specification: DINO
 Virtual Processor Support :PISCES 2 and DINO
 Direct Access to Shared Data: Linda

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 6


Advanced Computer Architecture 15CS72

 SPMD: DINO and Hypertasking


Process Management features
 Purpose: Efficient creation of parallel processes
 Implement multi-threading or multi-asking
 Program Partitioning and Replication
 Dynamic Load balancing at runtime
 Dynamic Process creation
 LWP(lightweight processes) (threads)
 Replicated Workers
 Partitioned networks
 Automatic Load Balancing

Parallel Language Constructs


Fortran 90 Array Notation:
 A multidimensional data array is represented by an array name indexed by a sequence of
subscript triplets, one for each dimension.
 Triplets for different dimensions are separated by commas. Examples are:

 Where each ei is an arithmetic expression that must produce a scalar integer value.
 The first expression e1 is a lower bound, the second e2 an upper bound, and the third e3 an
increment (stride).
 For example, B(l : 4 : 3, 6 : 8 : 2, 3) represents four elements B(l, 6, 3), B(4 ,6, 3), B(l, 8,
3), and B(4, 8, 3) of a three-dimensional array.
 When the third expression in a triplet is missing, a unit stride is assumed.
 The * notation in the second expression indicates all elements in that dimension starting
from el, or the entire dimension if e1 is also omitted.
 When both e2 and e3 are omitted, the e1 alone represents a single element in that
dimension. For example, A (5) represents the fifth element in the array A(3 : 7 : 2).
 Array assignments are permitted under the following constraints:
 The array expression on the right must have the same shape and the same number of
elements as the array on the left.
 For example, the assignment A(2 : 4, 5 : 8) =A(3 : 5, 1 : 4)
 When a scalar is assigned to an array, the value of the scalar is assigned to every element
of the array.
 For instance, the statement B(3 : 4, 5) = 0 sets B(3, 5) and B(4, 5) to 0.

Parallel Flow Control


Do, Enddo
 The conventional Fortran Do loop declares that all scalar instructions within the (Do,
Enddo) pair are executed sequentially, and so are the successive iterations.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 7


Advanced Computer Architecture 15CS72

(Doall, Endall )
 To declare parallel activities, we use the (Doall, Endall) pair.
 All iterations in the Doall loop are totally independent of each other.
 This implies that they can be executed in parallel if there are sufficient processors to
handle different iterations.
 However. the computations within each iteration are still executed serially in program
order.
(Doacross, Endacross)
 When the successive iterations of a loop depend on each other, we use the (Doacross,
Endacross) pair to declare parallelism with loop-carried dependences.
 Synchronizalions must be performed between the iterations that depend on each other.
 For example, dependence along the J-dimension exists in the following program.
 We use Doacross to declare parallelism along the I-dimension, but synchronization
between iterations is required.

(Cobegin, Coend)
 All computations specified within the block could be executed in parallel.
 But parallel processes may be created with a slight time difference in real
implementations.

Fork and Join


 During the execution of a process P, we can use a Fork Q command to invoke a new
process Q
 The Join Q command recombines the two processes into one process.
 Execution of Q is initialized when the Fork Q statement in P is executed.
 Programs P and Q are executed concurrently until either P executes the Join Q statement
or Q terminates.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 8


Advanced Computer Architecture 15CS72

Optimizing Compilers for Parallelism


 The role of a compiler is to remove the burden of program optimization and code
generation from the programmer.
 A parallelizing compiler consists of the following three major phases: flow analysis,
optimizations, and code generation, as depicted in Fig. 10.4.

Program Flow Analysis: This phase reveals the program flow patterns in order to determine
data and control dependences in the source code.
 Scalar dependence analysis is extended below to structured data arrays or matrices.
 Depending on the machine structure, the granularities of parallelism to be exploited are
quite different.
 Thus the flow analysis is conducted at different execution levels on different parallel
computers.
 The flow analysis must also reveal code/data reuse and memory-access patterns.
Program Optimization:
 This refers to the transformation of user programs in order to explore the hardware
capabilities as much as possible.
 In reality, most transformations are constrained by the machine architecture.
 The ultimate goal of program optimization is to maximize the speed of code execution.
 This involves the minimization of code length and of memory accesses and the
exploitation of parallelism in programs.
 The optimization techniques include vectorization using pipelined hardware and
parallelization using multiple processors simultaneously.
 Machine-dependent transformations are meant to achieve more efficient allocation of
machine resources, such as processors, memory, registers, and functional units.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 9


Advanced Computer Architecture 15CS72

 Other optimizations include elimination of unnecessary branches or common expressions.


Instruction scheduling can he used to eliminate pipeline or memory delays in executing
consecutive instructions.
Parallel Code Generation
 Code generation usually involves transformation from one representation to another,
called an Intermediate form.
 Parallel code generation is very different for different computer classes.
 For example, a superscalar processor may be software-scheduled or hardware-scheduled.
 How to optimize the register allocation on a RISC or superscalar processor, how to
reduce the synchronization overhead when codes are partitioned for multiprocessor
execution, and how to implement message-passing commands when codes/data are
distributed (or replicated) on a multicomputer are added difficulties in parallel code
generation.
DEPENDENCE ANALYSIS OF DATA ARRAYS
Iteration Space and Dependence
 The process of computing all the data dependences in a program is called dependence
analysis.
 The testing scheme presented below is based on the work of Goff, Kennedy, and Tseng
(1991) analysis
Dependence Testing
 Dependence testing is the method used to determine whether dependences exist between
two subscripted references to the same array in a loop nest.
 Suppose we wish to test whether or not there exists a dependence from statement S1 to
S2 in the following model loop nest of n levels, represented by n integer indices i1,12, .. .,
in.

Iteration Space
 The n-dimensional discrete Cartesian space for n deep loops is called an iteration space.
 The iteration is represented as coordinates in the iteration space. The following example
clarifies the concept of lexicographic order for the successive iterations in a loop nest.

 Consider a two-dimensional iteration space (Fig.10.5) representing the following two-


level loop nest in unit increment steps:

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 10


Advanced Computer Architecture 15CS72

Dependence Equations
 Let α and β be vectors of n integer indices within the ranges of the upper and lower
bounds of the n loops.
 There is a dependence from S1 to S2 if and only if there exist α and β such that α is
lexicographically less than or equal to β and the following system of dependence
equations is satisfied:

 Otherwise the two references are independent.


Distance and Direction vectors
 Suppose there exists a data dependence for α = (α1, α2,….., αn) and β = (β1, β2, ..., βn).
Then the distance vector D = (D1,D2, ..., Dn) is defined as β - α .
 The direction vector d = (d1,d2, .. ., dn) of the dependence is defined by

Example

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 11


Advanced Computer Architecture 15CS72

The distance and direction vectors for the dependence between iterations along three dimensions
of the array A are (1, 0, -l) and (<, = , >), respectively.

Subscript Separability and Partitioning


Subscript Categories
 When testing for dependence, we classify subscript positions by the total number of
distinct loop indices they contain.
 A subscript is said to be zero index variable (ZlV) if the subscript position contains no
index in either reference.
 A subscript is said to be single index variable (SIV) if only one index occurs in that
position.
 Any subscript with more than one index is said to be multiple index variable (MIV).
Example

Subscript Separability

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 12


Advanced Computer Architecture 15CS72

 When testing multidimensional arrays, we say that a subscript position is separable if its
indices do not occur in the other subscripts.
 If two different subscripts contain the same index, we say they are coupled.
 If all the subscripts are separable, we may compute the direction vector for each subscript
independently and merge the direction vectors on a positional basis with full precision.
 The first subscript is separable because index i does not appear in the other dimensions,
but the second and third are coupled because they both contain the index j.
 The ZIV subscripts are separable because they contain no indices.

Subscript Partitioning

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 13


Advanced Computer Architecture 15CS72

 We need to classify all the subscripts in a pair of array references as separable or as part
of some minimal coupled group.
 A coupled group is minimal if it cannot be partitioned into two nonempty subgroups with
distinct sets of indices.
 Once a partition is achieved, each separable subscript and each coupled group has
completely disjoint sets of indices.
 Each partition may then be tested in isolation and the resulting distance or direction
vectors merged without any loss of precision.

Categorized Dependence Tests


 The goal of dependence testing is to construct the complete set of distance and direction
vectors representing potential dependences between an arbitrary pair of subscripted
references to the same array variable.
The Testing Algorithm
(1)Partition the subscripts into separable and minimal coupled groups using the following
algorithm:

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 14


Advanced Computer Architecture 15CS72

(2) Label each subscript as ZIV, SIV, or MIV.


(3) For each separable subscript, apply the appropriate single subscript test (ZIV, SIV, MIV)
based on the complexity of the subscript. This will produce independence or direction vectors for
the indices occurring in that subscript.
(4) For each coupled group, apply a multiple subscript test to produce a set of direction vectors
for the indices occurring within that group.
(5) If any test yields independence, no dependences exist.
(6) Otherwise merge all the direction vectors computed in the previous steps into a single set of
direction vectors for the two references.

Test Categories
 The ZIV Test
 The SIV Test
 The MIV Test

The ZIV test


 The ZIV test is a dependence test performed on two loop-invariant expressions.
 If the system determines that the two expressions cannot be equal, it has proved
independence.
 Otherwise the subscript does not contribute any direction vectors and may be ignored.

The SIV Test


 An SIV subscript for index i is said to be strong if it has the form (ai + c1, ai’+ c2)
 For strong SIV subscripts, define the dependence distance as
ai’ + c2 = ai + c1
ai’ – ai = c1 – c2
a ( i’ – i) = c1- c2

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 15


Advanced Computer Architecture 15CS72

 A dependence exists if and only if d is an integer and |d| = U - L, where U and L are the
loop upper and lower bounds.
 For dependences that do exist, the dependence direction is given by

 The strong SIV test is thus an exact test that can be implemented very efficiently in a few
operations.
 A bounded iteration space is shown in Fig. 10.6a. The case of a strong SIV test is shown
in Fig. 10.6b.

A weak SIV Test


 A weak SIV subscript has the form (a1i + c1, a2i’ + c2), where the coefficients of the two
occurrences of index i have different constant values.
 Two special cases should be studied separately.
 Weak-Zero SIV test
 Weak-Crossing SlV test
Weak zero SIV test
 The case in which a1= 0 or a2 = 0 is called a weak-zero SIV subscript, as illustrated in
Fig. 10.6c.
 If a2 = 0, the dependence equation reduces to

 A similar check applies when a1 = 0.


 The weak-zero SIV test finds dependences caused by a particular iteration i.
 It is only necessary to check that the resulting value for i is an integer and within the loop
bounds.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 16


Advanced Computer Architecture 15CS72

Weak-Crossing SlV test


 All subscripts where a2 = -a1 are weak-crossing SIV illustrated in Fig. 10.6d.
 In this cases we set i= i’ and derive the dependence equation

 This corresponds to the intersection of the dependence equation with the line i = i’.
 To determine whether dependences exist, we simply need to check that the resulting
value i is within the loop bounds and is either an integer or has a noninteger part equal to
1/2.

The MIV Test


 SIV tests can be extended to handle complex iteration spaces where loop bounds may be
functions of other loop indices, e.g. triangular or trapezoidal loops.
 We need to compute the minimum and maximum loop bounds for each loop index.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 17


Advanced Computer Architecture 15CS72

 Starting at the outermost loop nest and working inward, we replace each index in a loop
upper bound with its maximum value (or minimal if it is a negative term).
 We do the opposite in the lower bound, replacing each index with its minimal value (or
maximal if it is a negative team).
 We evaluate the resulting expressions to calculate the minimal and maximal values for
the loop index and then repeat for the next inner loop.
 This algorithm returns the maximal range for each index, all that is needed for SIV tests.
 A special case of MIV subscripts, called RDlV (restricted double—index variable)
subscripts, have the form ( a1i + c1, a2j + c2). They are similar to SIV subscripts except
that i and j are distinct indices.
 The development of a parallelizing compiler is limited by the difficulty of having to deal
with many nonperfectly nested loops.
 The lack of dataflow information is often the ultimate limit on automatic compilation of
parallel code.

Chapter 11: Parallel Program Development and Environments

Parallel Programming Environments


 An environment for parallel programming consists of hardware platforms, languages
supported, OS and software tools, and application packages.
 The hardware platforms vary from shared-memory, message-passing, vector processing,
and SIMD to dataflow computers.
Software Tools and Environments
 Figure l1.1 shows a classification of environment types on the line between the minimal
languages and integrated environments.
 Integrated environments can be divided into basic, limited, and well-developed classes,
depending on the maturity of the tool sets.

 A basic environment provides a simple program tracing facility for debugging and
performance monitoring or a graphic mechanism for specifying the task dependence
graph.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 18


Advanced Computer Architecture 15CS72

 Limited integration provides tools for parallel debugging, performance monitoring, or


program visualization beyond the capability of the basic environments listed.
 Well-developed environments provide intensive tools for debugging programs,
interaction of textual/graphical representations of a parallel program, visualization
support for performance monitoring, program visualization, parallel IEO, parallel
graphics, etc.
Environment Features
 Design Issues
• Compatibility
• Expressiveness
• Ease of use
• Efficiency
• Portability
 Parallel languages may developed as extension to existing sequential languages
 A new parallel programming language has the advantage of:
o Using high-level parallel concepts or constructs for parallelism instead of
imperative (algorithmic) languages which are inherently sequential
 Special optimizing compilers detect parallelism and transform sequential constructs into
parallel.
 High Level parallel constructs can be:
• implicitly embedded in syntax, or
• explicitly specified by users
o Compiler Approaches as shown in figure 11.2
• Pre-processors:
o use compiler directives or macroprocessors
• Pre-compilers:
o include automated and semi-automated parallelizers
• Parallelizing Compilers

Y-MP, Paragon and CH-5 Environments

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 19


Advanced Computer Architecture 15CS72

o The software and programming environments of the Cray Y-MP, Intel Paragon XP/S, and
Connection Machine CM-5 are examined below.
Cray Y-MP Software
o The Cray Y-MP ran with the Cray operating systems UNICOS or COS.
o Two Fortran compilers CFT77 and CFT, provided automatic vectorizing, as did C and
Pascal compilers.
o Large library routines, program management utilities, debugging aids, and a Cray
assembler were included in the system software.
o Communications and applications software were also provided by Cray, by a third party,
or from the public domain.
o UNICOS for Y-MP was written in C and was a time-sharing OS extension of UNIX.
UNTCOS supported optimizing, vectorizing, concurrentizing Fortran compilers and
optimizing and vectorizing Pascal and C compilers.
o COS was a multiprogramming, multiprocessing, and multitasking OS. It offered a batch
environment to the user and supported interactive jobs and data transfers through the
front-end system.
o COS programs could run with a maximum of four processors up to 16M words of
memory on the Y-MP system.
o Migration tools were available to ease conversion from COS to UNICOS.
o CFT77 was a multipass, optimizing, transportable compiler. It carried out vectorization
and scalar optimization of Fortran 77 programs.
lntel Paragon XP/S Software
o The Intel Paragon XP/S system was an extension of the Intel Ipsc/860 and Delta systems.
o It was claimed to be a scalable, wormhole, mesh-connected multicomputer using
distributed memory.
o The processing nodes used were 50-MHz i860 XP processors.
o Paragon ran a distributed UNIX/OS based on OSF technology and was in conformance
with POSIX, System V.3, and Berkeley 4.3BSD.
o The languages supported included C, Fortran, Ada, C++, and data-parallel Fortran.
o The integrated tool suite included FORGE and CAST parallelization tools, Intel
ProSolver parallel equation solvers, and BLAS, NAG, SEGlib, and other math libraries.
o The programming environment provided an interactive parallel debugger (IPD) with a
hardware-aided performance visualization system (PVS).
CM-5 Software
o The software layers of the Connection Machine systems are shown in Fig. 11.3.
o The operating system used was CMOST, an enhanced version of UNIX/OS which
supported time-sharing and batch processing.
o The low-level languages and libraries matched the hardware instruction-set architectures.
o CMOST provided not only standard UNIX services but also supported fast IPC
capabilities and dataparallel and other parallel programming models.
o It also extended the UNIX I/O environment to support parallel reads and writes and
managed large files on data vaults.
o The Prism programming environment was an integrated Motif-based graphical
environment.
o Users could develop, execute, debug, and analyze the performance of programs with
Prism, which could operate on terminals or workstations running the X-Window system.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 20


Advanced Computer Architecture 15CS72

o High-level languages supported on the CM-5 included CM Fortran, C++, and *Lisp.
o CM Fortran was based on standard Fortran 77 with array processing extensions of
standard Fortran 90. CM Fortran also offered several extensions beyond Fortran 90, such
as a FORALL statement and some additional intrinsic functions.
o C* was an extension of the C programming language which supported data-parallel
programming. Similarly, *Lisp was an extension of the Common Lisp programming
language for data-parallel programming.
o Both could be used to structure parallel data and to compute, communicate, and transfer
data in parallel.

Visualization and Performance Tuning


 The performance of a parallel computer can be enhanced at several stages:
o At the machine design stage, the architecture and OS should be optimized to yield
high resource utilization and maximum system throughput.
o At the algorithm design/data structuring stage, the programmer should match the
software with the target hardware.
o At the compiling stage, the source code should be optimized for concurrency,
vectorization, or scalar optimization at various granularity levels.
o Finally, the compiled object code should go through further fine-tuning with the
help of program profilers or run-time information.

Synchronization and Multiprocessing Modes


Principles of Synchronization
 The performance and correctness of a parallel program execution rely heavily on efficient
synchronization among concurrent computations in multiple processors.
 The source of the synchronization problem is the sharing of writable objects among
processes.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 21


Advanced Computer Architecture 15CS72

 Once a writable object permanently becomes read-only, the synchronization problem


vanishes at that point.
 Synchronization consists of implementing the order of operations in an algorithm by
observing the dependences for writable data.
 We examine below
o Atomic operations, wait protocols, fairness policies, access order, and sole-access
protocol, based on the work of Bitar (1991), for implementing efficient
synchronization schemes.
Atomic Operations
 Two classes of shared memory access operations are
(1) an individual read or write such as Registerl := x and
(2) an indivisible read-modify-write such as x := f(x) or y := f(x).
 From the synchronization point of view, the order of program operations is described by
read-modify-write operations over shared writable objects called atoms. An operation on
an atom is called an atomic operation.
 A hard atom is one whose access races are resolved by hardware, whereas a soft atom is
one whose access races are resolved by software.
 The execution of operations may be out of program order as long as the execution order
preserves the meaning of the code.
 Three kinds of program dependences are identified below:
o Data dependences: WAR, RAW, and WAW.
o Control dependences: Program flow control statements such as goto and if-then.
o Side-effect dependences: Due to exceptions, traps, I/O accesses, time out, etc.
Wait Protocols
 There are two kinds of wait protocols
o In a busy wait, the process remains loaded in the processor's context registers and
is allowed to continually retry. While it does consume processor cycles, the
reason for using busy wait is that it offers a faster response when the shared object
becomes available.
o In a sleep wait, the process is removed from the processor and put in a wait queue.
The process being suspended must be notified of the event it is waiting for. The
system complexity increases in a multiprocessor using sleep wait as compared
with those implementing busy wait.
Fairness Policies
o For all suspended processes waiting in a queue, a fairness policy must be used to revive
one of the waiting processes.
o Three fairness policies are summarized below:
o FIFO: The wait queue follows a first-in-first-out policy.
o Bounded: The number of turns a waiting process will miss is upper-bounded.
o Livelock-free: One waiting process will always proceed; not all will wait forever.
o In general, the higher level of fairness corresponds to the more expensive
implementation. Another concern is the prevention of deadlock among competing
processes.
Sole-Access Protocols
o Conflicting atomic operations arc serialized by means of sole-access protocols.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 22


Advanced Computer Architecture 15CS72

o Three synchronization methods are described below based on who updates the atom and
whether sole access is granted before or after the atomic operations:
(1) Lock Synchronization: In this method, the atom is updated by the requester
process and sole access is granted before the atomic operation. For this reason,
it is also called pre-synchronization. The method can be applied for shared
read-only access.
(2) Optimistic Synchronization: This method also updates the atom by the
requester process. But sole access is granted after the atomic operation, as
described below. It is also called post-synchronization.
o A process may secure sole access after first completing an atomic
operation on a local version of the atom and then executing another
atomic operation on the global version of the atom.
o The second atomic operation determines if a concurrent update of the
atom has been made since the first operation was begun. If a
concurrent update has taken place, the global version is not updated;
instead, the first atomic operation is aborted and restarted from the
new global version.
 The name optimistic is due to the fact that the method expects that
there will be no concurrent access to the atom during the execution
of a single atomic operation.
(3) Server Synchronization: This method updates the atom by the server process
of the requesting process, as suggested by the name. Compared with lock
synchronization and optimistic synchronization, server synchronization offers
full service.
 An atom has a unique update server. A process requesting an atomic
operation on the atom sends the request to the atom‘s update server.
 The update server may be a specialized server processor (SP) associated
with the atom‘s memory module.

Multiprocessor Execution Modes


 Multiprocessing modes include parallel execution from the fine-grain process level to the
medium-grain task level, and to the coarse-grain program level.
 We examine the programming requirements for intrinsic multiprocessing as well as for
multitasking.
Multiprocessing Requirements
 Multiprocessing at the process level requires the use of shared memory in a tightly
coupled system.
 Summarized below are special requirements for facilitating efficient multiprocessing:
- Fast context switching among multiple processes resident in processors.
- Multiple register sets to facilitate context switching.
- Fast memory access with conflict-free memory allocation.
- Effective synchronization mechanism among multiple processors.
- Software tools for achieving parallel processing and performance monitoring.
- System and application software for interactive users.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 23


Advanced Computer Architecture 15CS72

Multitasking Environments
 Multitasking exploits parallelism at several levels:
- Functional units are pipelined or chained together.
- Multiple functional units are used concurrently.
- HO and CPU activities are overlapped.
- Multiple CPUs cooperate on a single program to achieve minimal execution time.
 In a multitasking environment, the tasks and data structures of a job must be properly
partitioned to allow parallel execution without conflict.
 However, the availability of processors, the order of execution, and the completion of
tasks are functions of the run-time conditions of the machine. Therefore, multitasking is
nondeterministic with respect to time.
 On the other hand, tasks themselves must be deterministic with respect to results.
 To ensure successful multitasking, the user must precisely define and include the
necessary communication and synchronization mechanisms and provide protection for
shared data in critical sections.

Multitasking on Cray Multiprocessor


 Three levels of multitasking are described below for parallel execution on C ray X-MP or
Y-MP multiprocessors.
Macrotasking
 When multitasking is conducted at the level of subroutine calls, it is called macrotasking
with medium to coarse grains. Macrotasking has been implementcd ever since the
introduction of Cray X-MP systems. The concept of macrotasking is depicted in Fig.
11.4a.

Figure 11.4a
 A main program forks a subroutine Si and then forks out three additional subroutines S2,
S3 and S4. Macrotasking is best suited to programs with larger, longer-running tasks.
 The execution of these four subroutine -calls on a uniprocessor (processor 0) is done
sequentially. Macrotasking can spread the calls across four processors.
 Note that each processor may have a different execution time. However, the overall
execution time is reduced due to parallel processing.
Microtasking

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 24


Advanced Computer Architecture 15CS72

 This corresponds to multitasking at the loop control level with finer granularity. Compiler
directives are often used to declare parallel execution of independent or dependent
iterations of a looping program construct.
 Figure 11.4b illustrates the spread of every four instructions of a Do loop to four
processors simultaneously through microtasking.
 When the iterations are independent of each other, microtasking is easier to implement.
When dependence does exist between the iterations, the system must resolve the
dependence before parallel execution can be continued.
 Interprocessor communication is needed to resolve the dependences.
 In addition to working efficiently on parts of programs where the granularity is small,
microtasking works well when the number of processors available for the job is unknown
or may vary during the program's execution.

Figure 11.4b
Autotasking
 The autotasking feature automatically divides a program into discrete tasks for parallel
execution on a multiprocessor. In the past, macrotasking was achieved by direct
programmer intervention. Microtasking was aided by an interactive compiler.
 Autotasking demands much higher degrees of automation. Only some Cray
multiprocessors, like the Cray Y-MP and C-90, were provided with autotasking software.
 Autotasking is based on the microtasking design and shares several advantages with
microtasking: very low overhead synchronization cost, excellent dynamic performance
independent of the number of processors available, both large and small granularity
parallelism, and so on.

Chapter 12: Instruction Level Parallelism

COMPUTER ARCHITECTURE
a) We define computer architecture as the arrangement by which the various system
building blocks like processors, functional units, main memory, cache, data paths,
and so on—are interconnected and interoperated to achieve desired system
performance.
b) Processors make up the most important part of a computer system. Therefore, in
addition to (a), processor design also constitutes a central and very important element

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 25


Advanced Computer Architecture 15CS72

of computer architecture. Various functional elements of a processor must be


designed, interconnected and inter-operated to achieve desire processor performance.
 System performance is the key benchmark in the study of computer architecture.
 A basic rule of system design is that there should be no performance bottlenecks in the
system. Typically, a performance bottleneck arises when one part of the system—i.e. one
of its subsystems—cannot keep up with the overall throughput requirements of the
system.
 If a performance bottleneck does occur in a system—i.e. if one subsystem is not able to
keep up with other subsystems—then the other subsystems remain idle, waiting for
response from the slower one.
 In a computer system, the key subsystems are processors, memories, I/O interfaces, and
the data paths connecting them.
 Within the processors, we have subsystems such as functional units, registers, cache
memories, and internal data buses.
 Example: Performance bottleneck in a system

 In Fig. 12.1 we see the schematic diagram of a simple computer system consisting of four
processors, a large shared main memory, and a processor-memory bus.
 For the three subsystems, we assume the following performance figures:
i. Each of the four processors can perform double precision floating point
operations at the rate of 500 million per second.
ii. The shared main memory can read/write data at the aggregate rate of 1000
million 32-bit words per second.
iii. The processor-memory bus has the capability of transferring 500 million
32-bit words per second to/from main memory.
 This system exhibits a performance mismatch between the processors, main memory, and
the processor memory bus.
 The system architect must pay careful attention to all such potential mismatches in
system design. Otherwise, the sustained performance which the system can deliver can
only equal the performance of the slowest part of the system—i.e. the bottleneck.
 In this particular case, a better investment for increased system performance could be
a) Faster processor-memory bus, and
b) Improved cache memory with each processor.

BASIC DESIGN ISSUES

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 26


Advanced Computer Architecture 15CS72

 The primary design emphasis on:


a) Exploiting fully the parallelism present in a single instruction stream, or
b) Supporting multiple instruction streams on the processor in multi-core and/or multi-
threading mode?
 In general, designs which aim to maximize the exploitation of instruction level
parallelism need deeper pipelines; up to a point, such designs may support higher clock
rates.
 But, beyond a point, deeper pipelines do not necessarily provide higher net throughput,
while power consumption rises rapidly with clock rate.
 Let us examine the trade-off involed in this context in a simplified way:

At a given time, VLSl technology limits the left hand side in the above equation s, while
the designer must select the two factors on the right. Aggressive exploitation of
instruction level parallelism, with multiple functional units and more complex control
logic, increases the chip area—and transistor count—per processor core.
 Of course system design would involve issues which are more complex than these, but a
basic design issue is seen here:
o How should the designers divide available chip resources among processors and,
within a single processor, among its various functional elements?
o Within a processor, a set of instructions are in various stages of execution at a
given time—within the pipeline stages, functional un its, operation buffers,
reservation stations, and so on. Recall that functional units themselves may also
be internally pipelined.
o To maintain the work flow oi‘ instructions within the processor, a superscalar
processor makes use of branch prediction.
PROBLEM DEFINITION
 we must address the problem of detecting and exploiting the parallelism which is implicit
within the instruction stream.
 We need a prototype instruction for our processor. We assume that the processor has a
load-store type of instruction set, which means that all arithmetic and logical operations
are carried out on operands which are present in programmable registers.
 Operands are transferred between main memory and registers by load and store
instructions only.
 We assume a three-address instruction format, as seen on most RISC processors, so that a
typical instruction for arithmetic or logical operation has the format:

 Data transfer instructions have only two operands—source and destination registers; load
and store instructions to/from main memory specify one operand in the form of a memory
address, using an available addressing mode.
 Effective address for load and store is calculated at the time of instruction execution.
 Conditional branch instructions need to be treated as a special category.
Review of dependence basic concepts

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 27


Advanced Computer Architecture 15CS72

 Data Dependences: Assume that instruction Ik follows instruction Ij in the program. Data
dependence between Ij and Ik means that both access a common operand.

 Of these, the first pattern of register access does not in fact; create a dependence, since
the two instructions can read the common value of the operand in any order.
 The other three patterns of operand access do create dependences amongst instructions.
Based on the underlined words shown above, these are known as RAW dependence,
WAR dependence, and WAW dependence, respectively.
 Sometimes we need to show dependences between instructions using graphical notation.
 We shall use small circles to represent instructions, and double line arrows between two
circles to denote dependences.
 The instruction at the head of the arrow is dependent on the instruction at the tail; if
necessary, the type of dependence between instructions may be shown by appropriate
notation next to the arrow.
 A missing arrow between two instructions will mean explicit absence of dependence.
 Single line arrows will be u.sod between instructions when we wish to denote program
order without any implied dependence or absence of dependence.
 Figure 12.2 illustrates this notation.

 Often dependences are thus depicted in a basic block of instructions— i.e. a sequence of
instructions with entry only at the first instruction, and exit only at the last instruction of
the sequence.
o Part (a) of Fig. 12.3 shows a basic block of six instructions, denoted I1 through I6
in program order.
o Part (b) of the figure shows a possible pattern of dependences as they may exist
amongst these six instructions. For simplicity, we have not shown the type of each
dependence, e.g. RAW(R3), etc.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 28


Advanced Computer Architecture 15CS72

Control Dependences
 In typical application programs, basic blocks tend to be small in length, since about 15%
to 20% instructions in programs are branch and jump instructions, with indirect jumps
and returns from procedure calls also included in the latter category.
 Because of typically small sizes of basic blocks in programs, the amount of instruction
level parallelism which can be exploited in a single basic block is limited.
 Assume that instruction Ij is a conditional branch and that, whether another instruction Ik
executes or not depends on the outcome of the conditional branch instruction Ij.
 In such a case, we say that there is a control dependence of instruction I k on Ij.
Resource Dependences
 This is possibly the simplest kind of dependence to understand, since it refers to a
resource constraint causing dependence amongst instructions needing the resource.
 A resource dependence which results in a pipeline stall can arise for access to any
processor resource— functional unit, data path, register bank, and so on.
 We can certainly say that such resource dependences will arise if hardware resources
provided on the processor do not match the needs of the executing program.
MODEL OF A TYPICAL PROCESSOR

Processor design with reorder buffer


 As we have seen, at one time multiple instructions are in various stages of execution
within the processor. But processor state and program state need to be maintained which
are consistent with the program order of completed instructions.
 This is important from the point of view of preserving the semantics of the program.
 Therefore, even with multiple instructions executing in parallel, the processor must
arrange the results of completed instructions so that their sequence reflects program
order.
 One way to achieve this is by using a reorder buffer, as shown in Fig. 12.4, which allows
instructions to be committed in program order, even if they execute in a different order;

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 29


Advanced Computer Architecture 15CS72

Processor design with reservation stations on functional units


 Functional units in the processor may themselves be internally pipelined; they may also
be provided with reservation stations, which accept operations issued by the issue stage
of the in instruction pipeline.
 A functional unit performs an operation when the required operands for it are available in
the reservation station.
 Figure 12.5 shows a processor design in which functional units are provided with
reservation stations.
 Such designs usually also make use of operand forwarding over a common data bus
(CDB), with tags to identify the source of data on the bus.
 Such a design also implies register renaming, which resolves RAW and WAW
dependences.
 A branch prediction unit has also been shown in Fig. 12.4 and Fig. 12.5 to implement
some form of a branch prediction algorithm.

COMPILER-DETECTED INSTRUCTION LEVEL PARALLELISM

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 30


Advanced Computer Architecture 15CS72

 In the process of translating a sequential source program into machine language, the
compiler performs extensive syntactic and semantic analysis of the source program.
 There are several ways in which the compiler can contribute to the exploitation of
implicit instruction level parallelism.
 One relatively simple technique which the compiler can employ is known as loop
unrolling, by which independent instructions from multiple successive iterations of a loop
can be made to execute in parallel.
 Unrolling means that the body of the loop is repeated n times for n successive values of
the control variable— so that one iteration of the transformed loop performs foe work of
n iterations of the original loop.
 Example: Loop unrolling
 Consider the following body of a loop in a user program, where all the variables except
the loop control variable i are assumed to be floating point:

 Now suppose that machine code is generated by the compiler as though the original
program had been written as:

 In the unrolled program fragment, the loop contains four independent instances of the
original loop body—indeed this is the meaning of loop unrolling.
 Suppose machine code corresponding to the second program fragment is executing on a
processor. Then clearly if the processor has sufficient floating point arithmetic
resources—instructions from the four loop iterations can he in progress in parallel on the
various functional units.
 It is clear that code length of the machine language program increases as a result of loop
unrolling; this increase may have an effect on the cache hit ratio.
 Also, more registers are needed to exploit the instruction level parallelism within the
longer unrolled loop.

OPERAND FORWARDING
 Operand forwarding, which helps in reducing the impact of true data dependences in the
instruction stream.
 Consider the following simple sequence of two instructions in a running program:

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 31


Advanced Computer Architecture 15CS72

 The result of the ADD instruction is stored in destination register R3, and then shifted
right by four bits in the second instruction, with the shifted value being placed in R4.
Thus, there is a simple RAW dependence between the two instructions—the output of the
first is required as input operand of the second.
 In terms of our notation, this RAW dependence appears as shown in Fig. 12.6, in the
form of a graph with two nodes and one edge.

 This sequence of data transfers has been illustrated in Fig. 12.7a


 When carried out in this order, clearly the two data transfer operations take two clock cycles.
 But note that the required two transfers of data can be achieved in only one clock cycle if
ALU output is sent to both R3 and ALU input in the same clock cycle as illustrated in Fig.
l2.7 (b).

 In technical terms, this type of an operation within a processor is known as operand


forwarding. Basically this means that, instead of performing two or more data transfers from
a common source one after the other, we perform them in parallel.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 32


Advanced Computer Architecture 15CS72

REORDER BUFFER
 Since instructions execute in parallel on multiple functional units, the reorder buffer serves
the function of bringing completed instructions back into an order which is consistent with
program order.
 Entries in the reorder buffer are completed instructions, which are queued in program order.
However, since instructions do not necessarily complete in program order, we also need a
flag with each reorder buffer entry to indicate whether the instruction in that position has
completed.
 Figure 12.9 shows a reorder buffer of size eight. Four fields are shown with each entry in the
reorder buffer—instruction identifier, value computed, program-specified destination of the
value computed, and a flag indicating whether the instruction has completed
 In Fig. 12.9, the head of queue of instructions is shown at the top, arbitrarily labeled as
instr[i].
 This is the instruction which would be committed nest—-if it has completed execution. When
this instruction commits, its result value is copied to its destination, and the instruction is
then removed from the reorder buffer.
 The nest instruction to be issued in the issue stage of the instruction pipeline then joins the
reorder buffer at its tail.
 If the instruction at the head of the queue has not completed, and the reorder butter is full,
then further issue of instructions is held up—i.e. the pipeline stalls—because there is no free
space in the reorder buffer for one more entry.

 We now take a brief look at how the use of reorder buffer addresses the various types of
dependences in the program.
(i) Data Dependence: A RAW dependence—i.e. true data dependence—will hold up the
execution of the dependent instruction if the result value required as its input operand is
not available. As suggested above, operand forwarding can be added to this scheme to
speed up the supply of the needed input operand as soon as its value has been computed.
(ii) Control Dependence: Suppose the instruction (s) in the reorder buffer belong to a
branch in the program which should it have been taken—i.e. there has been a mis-
predicted branch.
o Clearly then the reorder buffer should be flushed along with other elements of the
pipeline. Therefore the performance impact of control dependences in the running
program is determined by the accuracy of branch prediction technique employed.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 33


Advanced Computer Architecture 15CS72

o The reorder buffer plays no direct role in the handling of control dependences.
(iii) Resource Dependences: If an instruction needs a functional unit to execute, but the
unit is not free, then the instruction must wait for the unit to become free clearly no
technique in the world can change that. In such cases, the processor designer can aim to
achieve at least this: if a subsequent instruction needs to use another functional unit
which is free, then the subsequent instruction can be executed out of order.

REGISTER RENAMING
 Amongst the instructions in various stages of execution within the processor, there would
be occurrences of RAW, WAR and WAW dependences on programmable registers.
 As we have seen, RAW is true data dependence—since a value written by one instruction
is used as an input operand by another. But a WAR or WAW dependence can be avoided
if we have more registers to work with.
 We can simply remove such a dependence by getting the two instructions in question to
use two different registers.
 But we must also assume that the instruction set architecture (ISA) of the processor is
fixed—i.e. we cannot change it to allow access to a larger number of programmable
registers.
 Therefore the only way to make a larger number of registers available to instructions
under execution within the processor is to make the additional registers invisible to
machine language instructions.
 Instructions under execution would use these additional registers, even if instructions
making up the machine language program stored in memory cannot refer to them.
Working of register renaming for WAW dependence
 For example, let us say that the instruction:

 Both these instructions are writing to register R5, creating thereby a WAW dependence—
i.e. output dependence—on register R5. Clearly, any subsequent instruction should read
the value written into R5 by FSUB, and E the value written by FADD.
 Figure 12.10 shows this dependence in graphical notation.

 With additional registers available for use as these instructions execute, we have a simple
technique to remove this output dependence.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 34


Advanced Computer Architecture 15CS72

 Let FSUB write its output value to a register other than R5, and let us call that other
register X. Then the instructions which use the value generated by FSUB will refer to X,
while the instructions which use the value generated by FADD will continue to refer to
R5.
 Now since FADD and FSUB are writing to two different registers, the output dependence
or WAW between them has been removed.
 When FSUB commits, then the value in R5 should be updated by the value in X—i.e. the
value computed by FSUB. Then the physical register X, which is not a program visible
register, can be freed up for use in another such situation.
 Note that here we have mapped—or renamed R5 to X, for the purpose storing the result
of FSUB and thereby removed the WAW dependence from the instruction stream.

Working of register renaming for WAR dependence

 The first FADD instruction is writing a value into R2, which the second FADD
instruction is using, i.e. there is true data dependence between these two instructions.
 Let us assume that, when the first FADD instruction executes, R2 is mapped into
program invisible register Xm. The latter FSUB instruction is writing another value into
R2.
 Clearly, the second FADD (and other intervening instructions before FSUB) should see
the value in R2 which is written by the first FADD and not the value written by FSUB.
Figure 12.11 shows these two dependences in graphical notation.

 With register renaming, it is a simple matter to resolve the WAR anti-dependence


between the second FADD and FSUB.
 As mentioned, let Xm, be the program invisible register to which R2 has been mapped
when the first FADD executes. This is then the remapped register to which the second
FADD refers for its first data operand.
 Let FSUB write its output to a program invisible register other than Xm, which we denote
by Xn.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 35


Advanced Computer Architecture 15CS72

 Instructions which use the value written by FSUB refer to Xn, while instructions which
use the value written by the first FADD refer to Xm.
 The WAR dependence between the second FADD and FSUB is removed; but the RAW
dependence between the two FADD instructions is respected via Xm.
 When the first FADD commits, the value in Xm is transferred to R2 and program
invisible register Xm is freed up; likewise, later when FSUB commits, the value in X n is
transferred to R2 and program invisible register Xm is freed up.

TOMASULO’S ALGORITHM
 The processor in system was designed with multiple floating point units, and it made use
of an innovative algorithm for the efficient use of these units.
 The algorithm was based on operand forwarding over a common data bus, with tags to
identity sources of data values sent over the bus.
 The algorithm has since become known as tomasulo’s algorithm, after the name of its
chief designer, what we now understand as register ramming was also an implicit part of
the original algorithm.
 Recall that, for register renaming, we need a set of program invisible registers to which
programmable registers are re-mapped.
 Tomasulo‘s algorithm requires these program invisible registers to be provided with
reservation stations of functional units.
 Figure 12.12 shows such a functional unit connected to the common data bus, with three
reservation stations provided on it.

 The various fields making up a typical reservation station are as follows:

 When the needed operand value or values are available in a reservation station, the
functional unit can initiate the required operation in the next clock cycle.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 36


Advanced Computer Architecture 15CS72

 At the time of instruction issue, the reservation station is filled out with the operation.
code (op).
 If an operand value is available, for example in a programmable register, it is transferred
to the corresponding source operand field in the reservation station.
 However, if the operand value is not available at the time of issue, the corresponding
source tag (t1 and/or t2) is copied into the reservation station.
 The source tag identifies the source of the required operand. As soon as the required
operand value is available at its source—which would be typically the output of a
functional unit—the data value is forwarded over the common data bus, along with the
source tag.
 This value is copied into all the reservation station operand slots which have the matching
tag.
 Thus operand forwarding is achieved here with the use of tags. All the destinations which
require a data value receive it in the same clock cycle over the common data bus, by
matching their stored operand tags with the source tag sent out over the bus.
Example Tomasulo's algorithm and RAW dependence
 Assume that instruction I1 is to write its result into R4, and that two subsequent instructions
I2 and I3 are to read—i.e. make use of—that result value.
 Thus instructions I2 and I3 are truly data dependent (RAW) dependence) on instruction I1.
See Fig. 12.13.

 Assume that the value in R4 is not available when 12 and I3 are issued; the reason could
be, for example, that one of the operands needed for Il is itself not available. Thus we
assume that Il has not even started executing when I2 and I3 are issued.
 When I2 and I3 are issued, they are parked in the reservation stations of the appropriate
functional units.
 Since the required result value from I1 is not available, these reservation station entries of
I2 and I3 get source tag corresponding to the output of I1 —i.e. output of the functional
unit which is performing the operation of Il .
 When the result of I1 becomes available at its functional unit, it is sent over the common
data bus along with the tag value of its source—i.e. output of functional unit.
 At this point, programmable register R4 as well as the reservation stations assigned to I2
and I3 have the matching source tag—since they are waiting for the result value, which is
being computed by I1.
 When the tag sent over the common data bus matches the tag in any destination, the data
value on the bus is copied from the bus into the destination.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 37


Advanced Computer Architecture 15CS72

 The copy occurs at the same time into all the destinations which require that data value.
Thus R4 as well as the two reservation stations holding I2 and I3 receive the required
data value, which has been computed by Il , at the same time over the common data bits.
 Thus, through the use of source tags and the common data bus, in one clock cycle, three
destination registers receive the value produced by Il—programmable register R4, and
the operand registers in the reservation stations assigned to I2 and I3.

BRANCH PREDICTION
 A basic branch prediction technique uses a so-called two bit predictor. A two-bit counter
is maintained for every conditional branch instruction in the program.
 The two-bit counter has four possible states; these four states and the possible transitions
between these states are shown in Fig. 12.15.
 When the counter state is 0 or 1, the respective branch is predicted as taken; when the
counter state is 2 or 3, the branch is predicted as not taken.
 When the conditional branch instruction is executed and the actual branch outcome is
known, the state of the respective two-bit counter is changed as shown in the figure using
solid and broken line arrows.
 When two successive predictions come out wrong, the prediction is changed from branch
taken to branch not taken, and vice versa.
 In Fig. 12.15, state transitions made on mis-predictions are shown using broken line
arrows, while solid line arrows show state transitions made on predictions which come
out right.

LIMITATIONS IN EXPLOITING INSTRUCTION LEVEL PARALLELISM


Consider as an example a multiple issue processor which targets four instruction issues per clock
cycle, and has eight stages in the instruction pipeline.
1. Clearly, in this processor at one time as many as 32 instructions may be in different
stages of fetch, decode, issue, execute, write result, commit, and so on—and each stage in
the processor must handle four instructions in every clock cycle.
2. Dynamic scheduling would require a fairly large instruction window, to maintain the
issue rate at the targeted four instructions per clock cycle.
3. Instructions in the window must be checked for dependences, to support out of order
issue. This requires associative memory and its control logic, which means an overhead
in chip area and power consumption; such overhead would increase with window size.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 38


Advanced Computer Architecture 15CS72

4. In turn, such increased overhead in aggressive pursuit of instruction level parallelism


would adversely impact processor clock speed which is achievable, for a given VLSI
technology.
5. Also, with greater issue multiplicity k, there would be higher probability of less than it
instructions being issued in some clock cycles. The reason for this can be simply that a
functional unit is not available, or that true RAW dependences amongst instructions hold
up instruction issue.
6. The increased overhead also necessitates a larger number of stages in the instruction
pipeline, so as to limit the total delay per stage and thereby achieve faster clock cycle; but
a longer pipeline results in higher cost of flushing the pipeline.
7. The number of devices on the chip is largely determined by the fabrication technology
being used, and power consumption must be held within the limits of the heat dissipation
possible.

THREAD LEVEL PARALLELISM


 Let us consider once again the processor with instruction pipeline of depth eight, and
with targeted superscalar performance of four instructions completed in every clock
cycle.
 Now suppose that these instructions come from four independent threads of execution.
Then, on average, the number of instructions in the processor at any one time from one
thread would be 4 X 8/4 = 8.
 With the threads being independent of one another, there is a smaller total number of data
dependences amongst the instructions in the processor.
 Further, with control dependences also being separated into four threads, less aggressive
branch prediction is needed.
 Another major benefit of such hardware-supported multi-threading is that pipeline stalls
are very effectively utilized.
 If one thread runs into a pipeline stall—for access to main memory, say—then another
thread makes use of the corresponding processor clock cycles, which would otherwise be
wasted.
 Thus hardware support for multi-threading becomes an important latency hiding
technique.
 To provide support for multi-threading, the processor must be designed to switch
between threads—either on the occurrence of a pipeline stall, or in a round robin manner.
 Depending on the specific strategy adopted for switching between threads, hardware
support for multithreading may be classified as one of the following:
i. Coarse-grain multi-threading: refers to switching between threads only
on the occurrence of a major pipeline stall—which may be caused by, say,
access to main memory, with latencies of the order of a hundred processor
clock cycles.
ii. Fine-grain multi-threading: refers to switching between threads on the
occurrence of any pipeline stall, which may be caused by, say, Ll cache
miss. But this term would also apply to designs in which processor clock
cycles are regularly being shared amongst executing threads, even in the
absence of a pipeline stall.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 39


Advanced Computer Architecture 15CS72

iii. Simultaneous multi-threading refers to machine instructions from two


(or more) threads being issued in parallel in each processor clock cycle.
This would correspond to a multiple-issue processor where the multiple
instructions issued in a clock cycle come from an equal number of
independent execution threads.

Prepared by Archana. H, Dept. of CSE, HKBKCE Page 40

You might also like