COrrrrr Unit IV
COrrrrr Unit IV
COrrrrr Unit IV
The speedup or efficiently is achieved by using the pipelining depends on the number
of pipe stages and the number of available task that can be subdivide.
[Jan 2013]
Prepared By
Mrs. V.Rekha AP / MCA
Data hazards.
Instruction hazards.
Structural hazards.
Prepared By
Mrs. V.Rekha AP / MCA
It is used to overcome the drawback of assigning individual bits to each control signal
results in long microinstructions.
Prepared By
Mrs. V.Rekha AP / MCA
4
All operations and data transfers within the processor take place within time periods
defined by the processor clock.
27. Define multiphase clocking.
Edge-triggered flip-flops are not used; two or more clock signals may be needed to
guarantee proper transfer of data. This is known as multiphase clocking.
28. What are three steps that requires for the memory read operation?
R1out, MARin, Read
MDRinE, WMFC
MDRout, R2in
29. What are the actions that requires for executing of a complete instruction?
Fetch the instruction
Fetch the first operand (the contents of the memory location pointed to by R3).
Perform the addition
Load the result into RI
30. Define register file.
A three-bus structure used to connect the registers and the ALU of a processor. All
general-purpose registers are combined into a single block called the register file.
31. Define control store.
The micro routines for all instructions in the instruction set of a computer are stored
in a special memory called the control store.
32. Define vertical organization.
Highly encoded schemes that use compact codes to specify only a small number of
control functions in each macro instruction are referred to as a vertical organization.
33. Define horizontal organization.
The minimally encoded scheme if in which many resources can be controlled with a single
microinstruction is called a horizontal organization.
34. Define hazard.
Pipelined operation in is said to have been stalled for two clock cycles. Normal
pipelined operation resumes in cycle 7.Any condition that causes the pipeline to stall is
called a hazard.
35. Define control hazards.
The Pipeline may also be stalled because of a delay in the availability of an
instruction. For example, this may be a result of a miss in the cache, requiring the
instruction to be fetched from the main memory. Such hazards are often called control
hazards.
36. Define stalls.
The Decode unit is idle in cycle 3 through 5, the execute unit is idle in cycle 4
through 6 and the write unit is idle in cycle 5 through 7, such idle period are called stalls.
37. What is meant by dispatch unit?
The instruction queue can store several instructions. A separate unit, which we call
the dispatch unit, takes the instruction from the front of the queue and sends them to the
execution unit.
38. What is branch folding?
Prepared By
Mrs. V.Rekha AP / MCA
5
The instruction fetch unit has executed the branch instruction concurrently with the
execution of other instruction. This technique is referred as branch folding.
39. Define branch delay slot.
When execution of I2 is completed and a branch is to be made, the processor must
discard I3 and fetch the instruction at the branch target. The location following a branch
instruction is called a branch delay slot.
40. What is delayed branching?
A technique called delayed branching can minimize the penalty incurred as a result of
conditional branch instructions. The idea is simple. The instructions in the delay slots are
always fetched.
41. Define static branch prediction.
With either of these schemes, the branch prediction decision is always the same
every time a given instruction is executed. Any approach that has this characteristic is called
static branch prediction
42. Define dynamic branch prediction.
Approach in which the prediction decision may change depending on execution
history is called dynamic branch prediction.
43. Define multiple-issue.
A more aggressive approach is to equip the processor with multiple processing units
to handle several instructions in parallel in each processor stage. With this arrangement,
several instructions start execution in the same clock, and the processor is said to use
multiple-issue.
44. Define commitment unit.
When out-of-order execution is allowed, a special control unit is needed to guarantee
in-order commitment. This is called commitment unit.
45. Explain deadlock?
A deadlock is a situation that can arise when two units, A and B use a shared
resource. Suppose that unit B cannot complete its task unit A completes its task. At the
same time, unit B has been assigned a resource that unit A need. If this happens, neither
unit can complete its task. Unit A is waiting for the resource it needs, which is being held by
unit B. At the same time, unit B is waiting for unit A to finish before it can release that
recourse.
46. Define Superscalar operation.
Superscalar describes a microprocessor design that makes it possible for more than
one instruction at a time to be executed during a single clock cycle. In a superscalar design,
the processor or the instruction compiler is able to determine whether an instruction can be
carried out independently of other sequential instructions, or whether it has a dependency
on another instruction and must be executed in sequence with it.
47. List out the disadvantages of superscalar operations.
The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of
instruction-level parallelism.
The complexity and time cost of the dispatcher and associated dependency checking
logic.
Prepared By
Mrs. V.Rekha AP / MCA
[Dec 2011]
Prepared By
Mrs. V.Rekha AP / MCA
7
The CPU rearranges the instruction to reduce stalls while preserving dependences and this
technique is called as dynamic scheduling. It uses a hardware based mechanism to
rearrange instruction execution order to reduce stalls at run-time and enables handling
cases where dependences are unknown at compile time.
56. What are the advantages of super scalar processor?
[May 2012]
Hardware detects potential parallelism between instructions;
Hardware tries to issue as many instructions as possible in parallel.
Hardware solves register renaming.
If functional units are added in a new version of the architecture or some other
improvements have been made to the architecture, old programs can benefit from
the additional potential of parallelism.
Because the new hardware will issue the old instruction sequence in a more efficient
way.
57. Compare and contrast hardwired and microprogramming control. [Jan 2013]
Hardwired control:
Hardwired control is a control mechanism to generate control signals by using
appropriate finite state machine (FSM).
Hardwired systems are made to perform in a set manner, implemented with logic,
switches, etc. between any input and output in the system. Once the manner in
which the control is executed.
Hardwired control also can be used for implementing sophisticated CISC machines.
Microprogrammed control:
Microprogrammed control is a control mechanism to generate control signals by
using a memory called control storage (CS), which contains the control signals.
Although microprogrammed control seems to be advantageous to CISC machines,
since CISC requires systematic development of sophisticated control signals, there is
no intrinsic difference between these 2 control mechanisms.
The microprogrammed control is not always necessary to implement CISC machines.
Part -B
1. Write about general CPU organization with example. (or) Explain the process
Fundamental concepts.
[Jan 2012]
The processor fetches one instruction at a time and performs the operations specified.
Instructions are fetched from successive memory locations until a branch or a jump
instruction is encountered. The processor keeps track of the address of the memory location
containing the next instruction to be fetched using the program counter, PC. After fetching
an instruction, the contents of the PC are updated to point to the next instruction in the
sequence. A branch instruction may load a different value into the PC.
Another key register in the processor is the instruction register, IR. Suppose that each
instruction comprises 4 bytes, and that it is stored in one memory word. To execute an
instruction, the processor has to perform the following three steps:
Prepared By
Mrs. V.Rekha AP / MCA
Fetch the contents of the memory location pointed to by the PC. The contents of this
location are interpreted as an instruction to be executed. Hence, they are loaded into
the IR.
IR [[PC]]
Assuming that the memory is byte addressable, increment the contents of the PC by
4, that is,
PC [PC] + 4
Carry out the actions specified by the instruction in the IR.
Where an instruction occupies more than one word, steps 1 and 2 must be repeated as
many times as necessary to fetch the complete instruction. These two steps are usually
referred to as the fetch phase; step 3 constitutes the execution phase. In which the
arithmetic and logic unit (ALU) and all the registers are interconnected via a single common
bus. This bus is internal to the processor and should not be confused with the external bus
that connects the processor to the memory and I/O devices.
The data and address lines of the external memory bus are connected to the internal
processor bus via the memory data register, MDR, and the memory address register, MAR,
respectively. Register MDR has two inputs and two outputs. Data may be loaded into MDR
either from the memory bus or from the internal processor bus. The data stored in MDR
may be placed on either bus.
The input of MAR is connected to the internal bus, and its output is connected to the
external bus. The control lines of the memory bus are connected to the instruction decoder
and control logic block. This unit is responsible for issuing the signals that control the
operation of all the units inside the processor and for interacting with the memory bus.
The number and use of the processor registers R0 through R(n - 1) vary considerably from
one processor to another. Registers may be provided for general-purpose use by the
programmer. Some may be dedicated as special-purpose registers, such as index registers
or stack pointers.
Three registers, Y, Z, and TEMP, have not been mentioned before. These registers are
transparent to the programmer, that is, the programmer need not be concerned with them
because they are never referenced explicitly by any instruction.
The multiplexer MUX selects either the output of register Y or a constant value 4 to be
provided as input A of the ALU. The constant 4 is used to increment the contents of the
program counter. The two possible values of the MUX control input Select as Select4 and
SelectY for selecting the constant 4 or register Y, respectively.
As instruction execution progresses, data are transferred from one register to another, often
passing through the AL U to perform some arithmetic or logic operation. The instruction
decoder and control logic unit is responsible for implementing the actions specified by the
instruction loaded in the IR register.
The decoder generates the control signals needed to select the registers involved and direct
the transfer of data. The registers, the ALU, and the interconnecting bus are collectively
referred to as the datapath.
Prepared By
Mrs. V.Rekha AP / MCA
Enable the output of register Rl by setting R1out to 1. This places the contents of R 1
on the processor bus.
Enable the input of register R4 by setting R4in to 1. This loads data from the
processor bus into register R4.
Prepared By
Mrs. V.Rekha AP / MCA
10
All operations and data transfers within the processor take place within time periods defined
by the processor clock. The control signals that govern a particular transfer are asserted at
the start of the clock cycle.
Performing Arithmetic And Logical Operation:
The ALU is a combinational circuit that has no internal storage. It performs
arithmetic and logic operations on the two operands applied to its A and B inputs. The
operands is the output of the multiplexer MUX and the other operand is obtained directly
from the bus. The result produced by the ALU is stored temporarily in register Z. Therefore,
a sequence of operations to add the contents of register Rl to those of register R2 and store
the result in register R3 is
R1out, Yin
R2out, Select Y, Add, Zin
Zout, R3in
Fetching a Word from Memory:
The connection for register MDR has four control signals: MDR in and MDRout control
the connection to the internal bus, and MDR inE and MDRout E control the connection to the
external bus. The circuit is easily modified to provide the additional connections.
Prepared By
Mrs. V.Rekha AP / MCA
11
Input and output gating for one register bit.
Example:
MAR [R1]
Start a Read operation on the memory bus
Wait for the MFC response from the memory
Load MDR from the memory bus
R2 [MDR]
Prepared By
Mrs. V.Rekha AP / MCA
12
The updated value is moved from register Z back into the PC during step 2, while waiting for
the memory to respond. In step 3, the word fetched from the memory is loaded into the IR.
Steps 1 through 3 constitute the instruction fetch phase, which is the same for all
instructions. The instruction decoding circuit interprets the contents of the IR at the
beginning of step 4. This enables the control circuitry to activate the control signals for
steps 4 through 7, which constitute the execution phase. The contents of register R3 are
transferred to the MAR in step 4, and a memory read operation is initiated.
Then the contents of R 1 are transferred to register Y in step 5, to prepare for the
addition operation. When the Read operation is completed, the memory operand is available
in register MDR, and the addition operation is performed in step 6. The contents of MDR are
gated to the bus, and thus also to the B input of the ALU, and register Y is selected as the
second input to the ALU by choosing Select Y The sum is stored in register Z, then
transferred to R 1 in step 7. The End signal causes a new instruction fetch cycle to begin by
returning to step 1.
Branch Instruction:
A branch instruction replaces the contents of the PC with the branch target address.
This address is usually obtained by adding an offset X, which is given in the branch
instruction, to the updated value of the PC. A control sequence that implements an
unconditional branch instruction. Processing starts, as usual, with the fetch phase. This
phase ends when the instruction is loaded into the IR in step 3.
The offset value is extracted from the IR by the instruction decoding circuit, which
will also perform sign extension if required. Since the value of the updated PC is already
available in register Y, the offset X is gated onto the bus in step 4, and an addition operation
is performed. The result, which is the branch target address, is loaded into the PC in step 5.
The offset X used in a branch instruction is usually the difference between the branch target
address and the address immediately following the branch instruction. For example, if the
branch instruction is at location 2000 and if the branch target address is 2050, the value of
X must be 46.
The reason for this can be readily appreciated from the control sequence. The PC is
incremented during the fetch phase, before knowing the type of instruction being executed.
Thus, when the branch address is computed in step 4, the PC value used is the updated
value, which points to the instruction following the branch instruction in the memory.
Prepared By
Mrs. V.Rekha AP / MCA
13
Prepared By
Mrs. V.Rekha AP / MCA
14
In step 1, the contents of the PC are passed through the ALU, using the R=B control
signal, and loaded into the MAR to start a memory read operation. At the same time
the PC is incremented by 4. Note that the value loaded into MAR is the original
contents of the PC. The incremented value is loaded into the PC at the end of the
clock cycle and will not affect the contents of MAR.
In step 2, the processor waits for MFC and loads the data received into MDR, then
transfers them to IR in step 3.
Finally, the execution phase of the instruction requires only one control step to
complete, step 4. By providing more paths for data transfer a significant reduction in
the number of clock cycles needed to execute an instruction is achieved.
4. Explain Hardwired control with the block diagram, Micro Programmed control &
Micro instruction
[May 2012, Dec 2011 & Jan 2013]
The processor must have some means of generating the control signals needed in the
proper sequence. Computer designers use a wide variety of techniques to solve this
problem. The approaches used fall into one of two categories:
Hardwired control
Micro programmed control.
The required control signals are determined by the following information:
Contents of the control step counter
Contents of the instruction register
Contents of the condition code flags
External input signals, such as MFC and interrupt requests
The decoder/encoder block is a combinational circuit that generates the required control
outputs, depending on the state of all its inputs. By separating the decoding and encoding
Prepared By
Mrs. V.Rekha AP / MCA
15
functions. For any instruction loaded in the IR, one of the output lines INS 1 through INS m
is set to 1, and all other lines are set to O.
The input signals to the encoder block are combined to generate the individual control
signals Y in , PC OUh Add, End, and so on. An example of how the encoder generates the
Zin control signal for the processor organization. This circuit implements the logic function
signal is asserted during time slot Tl for all instructions, during T6 for an Add instruction,
during T 4 for an unconditional branch instruction, and so on. Circuit that generates the End
control signal from the logic function
The End signal starts a new instruction fetch cycle by resetting the control step
counter to its starting value. Control signal called RUN. When set to 1, RUN causes the
counter to be incremented by one at the end of every clock cycle. When RUN is equal to 0,
the counter stops counting.
The control hardware can be viewed as a state machine that changes from one state
to another in every clock cycle, depending on the contents of the instruction register, the
condition codes, and the external inputs. The outputs of the state machine are the control
signals. The sequence of operations carried out by this machine is determined by the wiring
of the logic elements, hence the name "hardwired." A controller that uses this approach can
operate at high speed. However, it has little flexibility, and the complexity of the instruction
set it can implement is limited.
Prepared By
Mrs. V.Rekha AP / MCA
16
A Complete Processor:
This structure has an instruction unit that fetches instructions from an instruction
cache or from the main memory when the desired instructions are not already in the cache.
It has separate processing units to deal with integer data and floating-point data. A data
cache is inserted between these units and the main memory. Using separate caches for
instructions and data is common practice in many processors today.
Prepared By
Mrs. V.Rekha AP / MCA
17
The microroutines for all instructions in the instruction set of a computer are stored in a
special memory called the control store. The control unit can generate the control signals for
any instruction by sequentially reading the CW s of the corresponding microroutine from the
control store. This suggests organizing the control unit.
To read the control words sequentially from the control store, a microprogram counter (PC)
is used. Every time a new instruction is loaded into the IR, the output of the block labeled
"starting address generator" is loaded into the PC.
In microprogrammed control, an alternative approach is to use conditional branch
microinstructions. In addition to the branch address, these microinstructions specify which
of the external inputs, condition codes, or, possibly, bits of the instruction register should be
checked as a condition for branching to take place.
Prepared By
Mrs. V.Rekha AP / MCA
18
The instruction Branch <0 may now be implemented by a microroutine. After loading this
instruction into IR, a branch microinstruction transfers control to the corresponding
microroutine, which is assumed to start at location 25 in the control store. This address is
the output of the starting address generator bloc. The microinstruction at location 25 tests N
bit of the condition codes. If this bit is equal to 0, a branch takes place to location 0 to fetch
a new machine instruction. Otherwise, the microinstruction at location 26 is executed to put
the branch target address into register Z. The microinstruction in location 27 loads this
address into the PC.
Microinstructions:
Horizontal and vertical organizations represent the two organizational extremes in
microprogrammed control. Many intermediate schemes are also possible, in which the
degree of encoding is a design parameter. The layout is a horizontal organization because it
groups only mutually exclusive microoperations in the same fields. As a result, it does not
limit in any way the processor's ability to perform various microoperations in parallel.
Highly encoded schemes that use compact codes to specify only a small number of control
functions in each microinstruction are referred to as a vertical organization. On the other
hand, the minimally encoded scheme, in which many resources can be controlled with a
single microinstruction, is called a horizontal organization.
The horizontal approach is useful when a higher operating speed is desired and when the
machine structure allows parallel use of resources. The vertical approach results in
considerably slower operating speeds because more microinstructions are needed to
perform the desired control functions.
Prepared By
Mrs. V.Rekha AP / MCA
19
5. Explain in detail the implementation of pipeline with a neat diagram. [Jan 2012]
In computer architecture Pipelining means executing machine instructions concurrently. The
pipelining is used in modern computers to achieve high performance. The speed of
execution of programs is influenced by many factors. One way to improve performance is to
use faster circuit technology to build the processor and the main memory. Another
possibility is to arrange the hardware so that more than one operation can be performed at
the same time. In this way, the number of operations performed per second is increased
even though the elapsed time needed to perform anyone operation is not changed.
Pipelining is a particularly effective way of organizing concurrent activity in a computer
system. The basic idea is very simple. It is frequently encountered in manufacturing plants,
where pipelining is commonly known as an assembly-line operation. The processor executes
a program by fetching and executing instructions, one after the other. Let Fi and Ei refer to
the fetch and execute steps for instruction Ii. Executions of a program consists of a
sequence of fetch and execute steps,
Now consider a computer that has two separate hardware units, one for fetching
instructions and another for executing them. The instruction fetched by the fetch unit is
deposited in an intermediate storage buffer, B1. This buffer is needed to enable the
execution unit to execute the instruction while the fetch unit is fetching the next instruction.
The results of execution are deposited in the destination location specified by the
instruction. The data can be operated by the instructions are inside the block labeled
"Execution unit".
The computer is controlled by a clock whose period is such that the fetch and execute steps
of any instruction can each be completed in one clock cycle. Operation of the computer
proceeds. In the first clock cycle, the fetch unit fetches an instruction I1 (step F1 ) and
stores it in buffer Bl at the end of the clock cycle. In the second clock cycle, the instruction
Prepared By
Mrs. V.Rekha AP / MCA
20
fetch unit proceeds with the fetch operation for instruction I2 (step F2). Meanwhile, the
execution unit performs the operation specified by instruction I1, which is available to it in
buffer Bl (step E1). By the end of the second clock cycle, the execution of instruction I1 is
completed and instruction I2 is available. Instruction I2 is stored in B1, replacing I1, which
is no longer needed. Step E2 is performed by the execution unit during the third clock cycle,
while instruction I3 is being fetched by the fetch unit. In this manner, both the fetch and
execute units are kept busy all the time.
Prepared By
Mrs. V.Rekha AP / MCA
21
For example, the two instructions Mul R2,R3,R4 and Add RS,R4,R6 give rise to a data
dependency. The result of the multiply instruction is placed into register R4, which in turn is
one of the two source operands of the Add instruction. Assuming that the multiply operation
takes one clock cycle to complete, execution. As the Decode unit decodes the Add
instruction in cycle 3, it realizes that R4 is used as a source operand.
Hence, the D step of that instruction cannot be completed until the W step of the multiply
instruction has been completed. Completion of step D2 must be delayed to clock cycle 5,
and is shown as step D2A. Instruction h is fetched in cycle 3, but its decoding must be
delayed because step D3 cannot precede D2. Hence, pipelined execution is stalled for two
cycles.
Prepared By
Mrs. V.Rekha AP / MCA
22
Operand forwarding:
The data hazard just described arises because one instruction, instruction I2 is waiting for
data to be written in the register file. However, these data are available at the output of the
ALU once the Execute stage completes step El. Hence, the delay can be reduced, or possibly
eliminated, if we arrange for the result of instruction I1 to be forwarded directly for use in
step E2.
The processor datapath involving the ALU and the register file. This arrangement is similar
to the three-bus structur, except that registers SRCl, SRC2, and RSLT have been added.
These registers constitute interstage buffers needed for pipelined operation. Registers SRC1
and SRC2 are part of buffer B2 and RSLT is part of B3. The data forwarding mechanism is
provided by the blue connection lines. The two multiplexers connected at the inputs to the
ALU allow the data on the destination bus to be selected instead of the contents of either
the SRCI or SRC2 register. When the instructions are executed in the datapath of the
operations performed in each clock cycle are as follows. After decoding instruction I2 and
detecting the data dependency, a decision is made to use data forwarding. The operand not
involved in the dependency, register R2, is read and loaded in register SRCI in clock cycle 3.
In the next clock cycle, the product produced by instruction I1 is available in register RSLT,
and because of the forwarding connection, it can be used in step E2. Hence, execution of I2
proceeds without interruption.
Handling data hazards in software:
I1: Mul R2,R3,R4
NOP
NOP
I2 : Add R5,R4,R6
Side effect:
The data dependencies encountered in the preceding examples are explicit and easily
detected because the register involved is named as the destination in instruction I1 and as a
source in I2. Sometimes an instruction changes the contents of a register other than the
one named as the destination.
Classification of data dependent hazards:
The Data dependent hazards can be classified into three types according to various data
update patterns, Consider two instructions I1 and I2, with I1 occurring before I2 in program
order.
I. Read After Write (RAW) (flow dependence hazard) ( R(1) D(2) )
Data hazard refers to a situation where an instruction refers to a result that has not yet
been calculated or retrieved.
II. Write After Read (WAR) (Anti dependence hazard) ( D(1) R(2) )
A write after read (WAR) data hazard represents a problem with concurrent execution.
III. Write After Write (WAW) (Output dependence hazard) ( R(1) R(2) )
A write after write (WAW) data hazard may occur in a concurrent execution environment.
7. Discus Instruction hazards.
[Jan 2012 & 2013]
Pipeline execution of instructions will reduce the time and improves the performance.
Whenever this stream is interrupted, the pipeline stalls illustrates for the case of a cache
miss. A branch instruction may also cause the pipeline to stall. The effect of branch
instructions and the techniques that can be used for mitigating their impact are discussed
with unconditional branches and conditional branches.
Prepared By
Mrs. V.Rekha AP / MCA
23
Unconditional branches:
A sequence of instructions being executed in a two-stage pipeline. Instructions I1 to I3 are
stored at successive memory addresses, and I2 is a branch instruction. Let the branch
target be instruction Ik. In clock cycle 3, the fetch operation for instruction 13 is in progress
at the same time that the branch instruction is being decoded and the target address
computed. In clock cycle 4, the processor must discard I3, which has been incorrectly
fetched, and fetch instruction Ik. In the meantime, the hardware unit responsible for the
Execute (E) step must be told to do nothing during that clock period.
Either a cache miss or a branch instruction stalls the pipeline for one or more clock cycles.
To reduce the effect of these interruptions, many processors employ sophisticated fetch
units that can fetch instructions before they are needed and put them in a queue. Typically,
the instruction queue can store several instructions. A separate unit, which we call the
dispatch unit, takes instructions from the front of the queue and sends them to the
execution unit. This leads to the organization. The dispatch unit also performs the decoding
function.
To be effective, the fetch unit must have sufficient decoding and processing capability to
recognize and execute branch instructions. It attempts to keep the instruction queue filled
at all times to reduce the impact of occasional delays when fetching instructions. If there is
a delay in fetching instructions because of a branch or a cache miss, the dispatch unit
continues to issue instructions from the instruction queue. The fetch unit continues to fetch
instructions and add them to the queue.
the queue length changes and how it affects the relationship between different pipeline
stages. Suppose that instruction I1 introduces a 2-cycle tall. Since space is available in the
queue, the fetch unit continues to fetch instructions and the queue length rises to 3 in clock
cycle 6. Instruction I5 is a branch instruction. Instructions I1, I2, I3, I4 and Ik complete
Prepared By
Mrs. V.Rekha AP / MCA
24
execution in successive clock cycles. Hence, the branch instruction does not increase the
overall execution time. This technique is referred to as branch folding.
Reading more than one instruction in each clock cycle may reduce delay. Having an
instruction queue like this is also beneficial in dealing with cache misses. The instruction
queue mitigates the impact of branch instructions on performance through the process of
branch folding. It has a similar effect on stalls caused by cache misses. The effectiveness of
this technique is enhanced when the instruction fetch unit is able to read more than one
instruction at a time from the instruction cache.
Prepared By
Mrs. V.Rekha AP / MCA
25
Buffer registers have been introduced at the inputs and output of the ALU. These are
registers SRCl, SRC2, and RSLT. Forwarding connections may be added if desired. The
instruction register has been replaced with an instruction queue, which is loaded from the
instruction cache. The output of the instruction decoder is connected to the control signal
pipeline. This pipeline holds the control signals in buffers B2 and B.3
The following operations can be performed independently in the processor,
Reading an instruction from the instruction cache
Incrementing the PC
Decoding an instruction
Reading from or writing into the data cache
Reading the contents of up to two registers from the register file
Writing into one register in the register file
Performing an ALU operation
Prepared By
Mrs. V.Rekha AP / MCA
26
In a superscalar processor, the detrimental effect on performance of various hazards
becomes even more pronounced. The compiler can avoid many hazards through judicious
selection and ordering of instructions. For example, the compiler should strive to interleave
floating-point and integer instructions.
This would enable the dispatch unit to keep both the integer and floating-point units busy
most of the time. In general, high performance is achieved if the compiler is able to arrange
program instructions to take maximum advantage of the available hardware units.
Out-of-order execution:
Instructions are dispatched in the same order as they appear in the program. However, their
execution is completed out of order. Suppose one issue arise from dependencies among
instructions.
To guarantee a consistent state when exceptions occur, the results of the execution of
instructions must be written into the destination locations strictly in program order. This
means we must delay step W2 until cycle 6. In turn, the integer execution unit must retain
the result of instruction I2, and hence it cannot accept instruction I4 until cycle 6. If an
exception occurs during an instruction, all subsequent instructions that may have been
partially executed are discarded. This is called a precise exception. It is easier to provide
precise exceptions in the case of external interrupts. At this point, the processor and all its
registers are in a consistent state, and interrupt processing can begin.
Prepared By
Mrs. V.Rekha AP / MCA
27
[Jan 2012]
[Dec 2011]
A branch instruction loads the processors program counter with a new non-sequential
value. Consequently, all the instructions whose execution was started before the branch was
taken are suddenly redundant and the pipeline has to be refilled with instructions following
the branch target address. The cost of executing an operation that causes a non-sequential
flow of control is known as the branch penalty.
Instructions that modify the flow of control reducing or even eliminating the bubble in the
RISCs pipeline caused when a branch is taken; that is, concerned with ways of reducing the
Prepared By
Mrs. V.Rekha AP / MCA
28
branch penalty. Some of the techniques involve limiting the damage done by a branch and
some techniques attempt to predict the outcome of a branch before it has been executed.
Several instructions modify the flow of control; for example, the unconditional branch, the
conditional branch, the subroutine call, and the subroutine return. Internally generated
traps and exceptions and externally generated interrupts also modify the flow of control.
Subroutine call and returns are not normally regarded as branch operations from the
computer architect's point of view, but they have similar characteristics from the computer
designer's point of view; that is, they also incur a branch penalty. The unconditional branch
is always taken and forces execution to continue at the target address. An unconditional
branch is equivalent to the high-level language go to and its outcome is known at compiletime.
Reduce branch penalty:
The outcome of a conditional branch is determined by the state of one or more flag bits in
the processor's condition code register and is therefore not known until runtime. The
conditional branch may be taken. When a branch is not taken, the outcome is sometimes
called in line because the next instruction immediately following the branch is executed. A
subroutine call is a type of unconditional branch that saves the return address. Similarly, a
subroutine return is an unconditional branch that fetches the target address from a register
or the stack. Some computers support conditional subroutine calls and returns.
1. Predict branch/jump instructions AND branch direction (taken or not taken)
2. Predict branch/jump target address (for taken branches)
3. Speculatively execute instructions along the predicted path
Anna University Questions
Part- A
1.
2.
3.
4.
5.
6.
7.
8.
[Ref.
[Ref.
[Ref.
[Ref.
[Ref.
[Ref.
[Ref.
[Ref.
No.:
No.:
No.:
No.:
No.:
No.:
No.:
No.:
1]
4]
15]
48]
49]
56]
7]
57]
Part B
1. Write about general CPU organization with example. (or) Explain the process
Fundamental concepts.
[Ref. No.: 1]
2. List and explain the steps involved in the execution of a complete Instruction Sets.
[Ref.
No.: 2]
3. Explain Hardwired control with the block diagram, Micro Programmed control & Micro
instruction
[Ref. No.: 4]
4. Explain in detail the implementation of pipeline with a neat diagram.
[Ref. No.: 5]
5. What is a Data hazards? How will you overcome it?
[Ref. No.: 6]
6. Discus Instruction hazards.
[Ref. No.: 7]
7. Difference between micro programmed and hardwired control.
[Ref. No.: 10]
Prepared By
Mrs. V.Rekha AP / MCA
29
8. What is branch penalty? Explain how branch penalty is reduced.
Prepared By
Mrs. V.Rekha AP / MCA