Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lecture 03

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 30

Execution time

• Execution Time (processor-related)


= IC x CPI x T
IC = instruction count
CPI = average number of system clock
periods to execute an
instruction
T = clock period
Example
Consider two SRC programs having three types of
instructions given as follows
Number of .. Program 1 Program 2

data transfer instructions 2 1


control instructions 2 5
ALSU Instructions 2 1
Compare both the programs for the following parameters

1. Instruction count
2. Speed of execution
Example contd..
1. Instruction count IC.
IC for program 1= 2+2+2=6
IC for program 2= 1+5+1=7
2. For execution time we can use the following SRC
specifications.
ET = IC x CPI x T Instruction Type CPI
ET1= (2x2)+(2x3)+(2x4)
Control 2
= 18
ALSU 3
ET2 =(5x2)+(1x3)+(1x4)
=17 Data Transfer 4
Note: Since both programs are executing on the same machine, the T factor can
be ignored while calculating ET.
Problem: Consider the following SRC code segments for
implementing the operation a=b+5c. Find which one is more
efficient in terms of instruction count and execution time.
Program 1: Multiplication by using
repeated addition in a for loop
org 0 mpy:
a: .dw 1 brzr r7,r5 ; jump to next after 5
b: .dw 1 iterations
c: .dw 1 add r4,r4,r3 ;r4 contains r4+c
.org 80 addi r5,r5,-1 ; decrement index
la r5, 5 ; load value of loop br r6 ; loop again
lar r6,mpy ;load address of mpy next:
lar r7, next ;load address of next add r4,r4,r2 ; r4 contains sum
ld r2, b ; load contents of b of
ld r3, c ; load contents of c b and 5c
la r4, 0 ;load 0 in r4 st r4, a ;store at address a
stop
Problem: Consider the following two SRC code segments for
implementing the operation a=b+5c. Find which one is more
efficient in terms of instruction count and execution time.

Program 2: Multiplication using sub-


routine call

.org 0 stop
a: .dw 1 mpy:
b: .dw 1 la r7,0 ;r7 contains zero
c: .dw 1 lar r8,again ;r8 contain again address
.org 80 again:
lar r1,mpy ;load address of mpy in r1 brzr r5,r3 ;exit loop when index is
0
ld r2, b ; load contents of b in r2 add r7,r7,r4 ; r7 contains r7+c
la r3,5 ; load index in r3 addi r3,r3,-1 ; decrement index
ld r4,c ; load contents of c in r4 br r8
brl r5, r1 ; r5 contains PC
add r2,r2,r7 ; r2 contains sum b+5c
st r2, a
Solution
The instructions in both programs can be divided into 3
types and the respective count of each type is

Number of.. Program 1 Program 2

Data transfer 7 7
instructions
Control instructions 3 4

ALSU instructions 3 3

IC for program 1 = 7 + 3 + 3= 13
IC for program 2 = 7 + 4 + 3= 14
Solution contd..
For execution time, consider the following SRC
specifications.
Instruction Type CPI
ET = IC x CPI x T
Control 2
ET1= (7x4)+(3x2)+(3x3)
ALSU 3
= 43T
ET2= (7x4)+(4x2)+(3x3) Data Transfer 4
= 45T
Conclusion:
Program 1 runs faster than program 2 as obvious from the
execution time of both.
MIPS
• Millions of Instructions Per Second
= IC / (ET x 106)
• Capability of different instructions varies from
machine to machine, eg. RISC machines have
simpler instructions, so the same job will require
more instructions
• Was popular when the VAX 11/780 was treated
as a reference – late 70s and early 80s
MIPS as a performance metric
• MIPS is inversely proportional to execution
time,
ET= IC / (MIPS x 106 )
Example
Consider a machine having a 100 MHz clock and three
instruction types with following Instruction Type CPI
parameters. Control 2
Now suppose that two
ALSU 3
different compilers generate
Data Transfer 4
code for the same program.
The instruction count for each is given as follows
IC in millions Code from Code from
compiler 1 compiler 2
Control 5 10
ALSU 1 1
Data Transfer 1 1
Compare the two codes according to MIPS and
according to execution time.
Solution:
First we find the CPI for both code sequences
Since CPI = clock cycles for each type of instruction / IC
CPI1= (5x2 + 1x3 + 1x4)/ 7 = 2.43
CPI2= (10x2 +1x3 + 1x4)/12 = 2.25

As MIPS= Clock Rate/ (CPI x 106 )


MIPS1= 100 x 106 / (2.43 x 106)
= 41.15
MIPS2=100 x 106 / (2.25 x 106)
= 44.44
Hence the code generated by compiler 2 has higher MIPS
Rating.
Compare the two codes according to MIPS and
according to execution time.

Solution:
First we find the CPI for both code sequences
Since CPI = clock cycles for each type of instruction / IC
CPI1= (5x2 + 1x3 + 1x4)/ 7 = 2.43
CPI2= (10x2 +1x3 + 1x4)/12 = 2.25

As MIPS= Clock Rate/ (CPI x 106 )


MIPS1= 100 x 106 / (2.43 x 106) As MIPS = IC / (ET x 106)
MIPS= (IC x clock rate)/
= 41.15
( IC x CPI x 106)
MIPS2=100 x 106 / (2.25 x 106) = Clock rate/(CPI x 106)
= 44.44
Hence the code generated by compiler 2 has higher MIPS
Rating.
Solution contd..
Since ET = IC / (MIPS x 106)
ET1= (7 x 106) / (41.15 x 106)
= 0.17 seconds
ET2= (12 x 106) / ( 44.44 x 106)
= 0.27 seconds
Hence code sequence 1 is much more efficient in
terms of
execution time.
MFLOPS
• Millions of FLoating point Operations Per
Second
• Using FP operations makes more sense to some
compared to using just any instructions
• Results vary from FP op to FP op
• Better compared to MIPS because of two
reasons:
2 reasons
1. FP ops are complex, and therefore, provide a
better picture of the hardware capabilities on
which they are run
2. Overheads (get operands, store results, etc. )
are effectively lumped with the FP ops they
support
Dhrystones ***
• Dhrystone is a general “integer performance”
benchmark test originally developed by Reinhold
Weicker in 1984.
• Small program; less than 100 HLL statements
• Compiles to about 1 to 1.5 Kb of code

*** The name is a play on the word Whetstone


Disadvantages of using
Whetstones and Dhrystones
Both Whetstones and Dhrystones are now
considered obsolete because of the following
reasons.
 Small, fit in cache
 Obsolete instruction mix
 Prone to compiler tricks
 Difficult to reproduce results
 Uncontrolled source code
SPEC
• System Performance Evaluation Cooperative
• (SPEC) was founded in October, 1988, by
Apollo, Hewlett-Packard, MIPS Computer
Systems and SUN Microsystems
• Latest version is SPEC CPU2000
SPEC
• The standard SPEC benchmark suite includes:
 A compiler
 A Boolean minimization program
 A spreadsheet program
 A number of other programs that stress
arithmetic processing speed
• It uses a simple metric, elapsed time, to
measure performance of competing machines
• Machine independent code is used for fair
comparisona
Advantages
• It provides for ease of publication.
• Each benchmark carries the same weight.
• SPECratio is dimensionless.
• It is not unduly influenced by long running
programs.
• It is relatively immune to performance variation
on individual benchmarks.
• It provides a consistent and fair metric.
Programmer’s view of the SRC
31 0
R0 7 0
R1 0
: 1
R31 2
Register file :
:
:
IR
232-1
PC

CPU Main memory


SRC: Notation
• R[3] means contents of register 3
• M[8] means contents of memory location 8
• A memory word at address 8 is defined as the
32 bits at address 8,9,10 and 11
SRC: Notation
(continued…)

• Special notation for 32-bit memory words


M[8]<31…0>:=M[8]©M[9]©M[10]©M[11]
© is used to represent concatenation
 Logical addresses

7 0
a M[8] One memory “word”
a+1 M[9]
31 24 23 16 15 8 7 0
a+2 M[10]
M[8] M[9] M[10] M[11]
a+3 M[11]
MS Byte LS Byte
SRC: instruction formats
31 27 26 0
Type A Op-code unused

31 27 26 22 21 0
Type B Op-code ra c1

31 27 26 22 21 17 16 0
Type C Op-code ra rb c2

31 27 26 22 21 17 16 12 11 0
Type D Op-code ra rb rc c3
31 27 26 0

Type A Op-code unused

Only two instructions


 nop (op-code = 0)
•useful in pipelining
 stop (op-code = 31)
Both are 0-operand
31 27 26 22 21 0

Type B Op-code ra c1

Note: R8 is register name and R[8] means contents of register R8

 three instructions; all three use relative addressing mode


 ldr (op-code = 2 ) load register from memory using relative address
ldr R3, 56 R[3] M[PC+56]
 lar (op-code = 6 ) load register with relative address
lar R3, 56 R[3] PC+56
 str (op-code = 4) store register to memory using relative address

str R8, 34 M[PC+34] R[8]


the effective address is computed at run-time by adding a
constant to the PC
makes the instructions relocatable
31 27 26 22 21 17 16 0

Type C Op-code ra rb c2

 three load/store instructions, plus three ALU instructions


 ld (op-code = 1 ) load register from memory
ld R3, 56 R[3] M[56] (rb field = 0)
ld R3, 56(R5) R[3] M[56+R[5]] (rb field ≠ 0)
 la (op-code = 5 ) load register with displacement
address
la R3, 56 R[3] 56
la R3, 56(R5) R[3] 56+R[5]
 st (op-code = 3 ) store register to memory
st R8, 34 M[34] R[8]
st R8, 34(R6) M[34+R[6]] R[8]
Problem: Consider the following two SRC code segments for
implementing multiplication. Find which one is more efficient in
terms of instruction count and execution time.

Program 1: Multiplication by using Program 2: Multiplication using sub-


repeated addition in a for loop routine call
la r5, 5 ; load value of loop lar r1,mpy ;load address of mpy in r1
lar r6,mpy ;load address of mpy
lar r7, next ;load address of next ld r2, b ; load contents of b in r2
ld r2, b ; load contents of b in r2 la r3,5 ; load index in r3
ld r3, c ; load contents of c in r3 ld r4,c ; load contents of c in r4
la r4, 0 ;load 0 in r4 brl r5, r1 ; r5 contains PC
mpy: add r2,r2,r7 ; r2 contains sum of b & 5c
brzr r7,r5 ; jump to next after 5 iteration st r2, a
add r4,r4,r3 ;r4 contains r4+c mpy:
lar r8,again ;r8 contain again address
addi r5,r5,-1 ; decrement index
br r6 ; loop again again:
next: brzr r5,r3 ;exit loop when index is 0
add r4,r4,r2 ; r4 contains sum of b add r7,r7,r4 ; r7 contains r7+c
st r4, a ;store at address label a addi r3,r3,-1 ; decrement index
br r8
Solution
The instructions in both programs can be divided into 3
types and the respective count of each type is

Number of.. Program 1 Program 2

Data transfer 7 6
instructions
Control instructions 2 3

ALSU instructions 3 3

IC for program 1 = 7 + 2 + 3= 12
IC for program 2 = 6 + 3 + 3= 12
Solution contd..
For execution time, consider the following SRC
specifications.
Instruction Type CPI
ET = IC x CPI x T
ET1= (7x4)+(2x2)+(3x3) Control 2
= 41 ALSU 3
ET2= (6x4)+(3x2)+(3x3) Data Transfer 4
= 39
Conclusion:
Although the instruction count for both programs is same,
program 2 runs much faster than program 1 due to lesser
number of clock cycles required.

You might also like