15IF11 Multicore E PDF

15IF11: Multicore Technology @ PSG Tech, Coimbatore
Session-5
Dr. John Jose

Assistant Professor
Department of Computer Science & Engineering
Indian Institute of Technology Guwahati, Assam.
9th & 10th March 2019
Problem-1: Amdahl’s Law
A new floating-point unit speeds up floating point operations by two times. In
an application one fifth of the instructions are floating-point operations.
(a) What is the overall speedup? (Ignore the penalty to other instructions).
(b) Assume that the speeding up of the floating-point unit mentioned above
slowed down data cache accesses resulting in a 1.5x slowdown.
Assume the load instructions constitute 15% and store instructions
constitute 9% of the total instruction what is the effective overall
speedup now?
(a) S = 1/ { (1-f) + (f/N) } = 1 / { (1- 0.2) + (0.2/2) } = 1.11 times
(b) S = 1/ { (1-f1-f2) + (f1/N1) + (f2/N2) }

= 1 / { (1- 0.2-0.24) + (0.2/2) + (0.24/0.67) }
= 0.98 times
Problem-2: Basic Performance Analysis
Consider two programs A and B that solves a given problem. A is scheduled
to run on a processor P1 operating at 1 GHz and B is scheduled to run on
processor P2 running at 1.4 GHz. A has total 10000 instructions, out of
which 20% are branch instructions, 40% load store instructions and rest are
ALU instructions. B is composed of 25% branch instructions. The number of
load store instructions in B is twice the count of ALU instructions. Total
instruction count of B is 12000. In both P1 and P2 branch instructions have
an average CPI of 5 and ALU instructions has an average CPI of 1.5. Both
the architectures differ in the CPI of load-store instruction. They are 2 and 3
for P1 and P2, respectively. Which mapping (A on P1 or B on P2) solves the
problem faster, and by how much?
A on P1 (1GHz  1ns) B on P2 (1.4 GHz0.714ns)

IC=10000 IC=12000
Fraction BR: L/S: ALU = 20: 40: 40 Fraction BR: L/S: ALU = 25: 50: 25
CPI of BR: L/S: ALU = 5: 2: 1.5 CPI of BR: L/S: ALU = 5: 3 : 1.5
Problem-2: Basic Performance Analysis
A on P1 (1GHz  1ns) B on P2 (1.4 GHz0.714ns)

IC=10000 IC=12000
Fraction BR: L/S: ALU = 20: 40: 40 Fraction BR: L/S: ALU = 25: 50: 25
CPI of BR: L/S: ALU = 5: 2: 1.5 CPI of BR: L/S: ALU = 5: 3 : 1.5
(a) CPI A_P1=(0.2x5 + 0.4x2 + 0.4x1.5) = 2.4

ExT = 2.4 x10000x1= 24000 ns
(a) (b) CPI B_P2=(0.25x5 + 0.5x3 + 0.25x1.5) = 3.125

ExT = 3.125 x12000x0.714= 26775 ns
Hence A on P1 is faster.
Problem-3: Pipeline Hazards
A program has 2000 instructions in the sequence L.D, ADD.D, L.D, ADD.D,..... L.D,
ADD.D. The ADD.D instruction depends on the L.D instruction right before it. The L.D
instruction depends on the ADD.D instruction right before it. If the program is
executed on the 5-stage pipeline what would be the actual CPI with and without
operand forwarding technique?
Without operand forwarding.
ID of nth instruction can be only after WB of n-1th instruction.
3 stalls in each instruction.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
L.D IF ID EX ME WB
ADD IF * * * ID EX ME WB
L.D IF * * * ID EX ME WB
ADD IF * * * ID
Instructions reach WB at clock cycles 5, 9, 13, 17, 21, 25, 29,…..

Last instruction (ADD) reaches WB in 5 + (1999x4) = 8001 cycles.
CPI= 8001/2000=4.001
Problem-3: Pipeline Hazards
A program has 2000 instructions in the sequence L.D, ADD.D, L.D, ADD.D,..... L.D,
ADD.D. The ADD.D instruction depends on the L.D instruction right before it. The L.D
instruction depends on the ADD.D instruction right before it. If the program is
executed on the 5-stage pipeline what would be the actual CPI with and without
operand forwarding technique?
With operand forwarding.
Every ADD after L.D has a stall, but L.D after ADD do not have a stall.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
L.D IF ID EX ME WB
ADD IF * ID EX ME WB
L.D IF ID EX ME WB
ADD IF * ID EX ME WB
Instructions reach WB at clock cycles 5,7, 8,10, 11,13, 14,16

Last instruction (ADD) reaches WB in 7 + (999x3) = 3004 cycles.
CPI= 3004/2000=1.502
Problem-4: Index and Offset Calculations
A cache has 512 KB capacity, 4B word, 64B block size and 8-
way set associative. The system is using 32 bit address. Given
the following addresses, which set of cache will be searched
and specify which word of the selected cache block will be
forwarded if it is a hit in cache? (a) 0X ABC89984 (b) 0X
485669AC
# sets = CS/(BSxA) = 219/(26x23) = 210 = 1024 sets
1 word = 4B , Hence 64 byte block has 16 words
Tag = 16 Index =10 Offset=6 (4+2)
0x ABC89984 = 1001 1001 1000 0100  Set 614, word 1
0x 485669AC = 0110 1001 1010 1100  Set 422, word 11

Problem-5: Optimization
A cache has access time (hit latency) of 10 ns and miss rate of
5%. An optimization was made to reduce the miss rate to 3%
but the hit latency was increased to 15 ns. Under what
condition this change will result in better performance (Lower
AMAT)?
AMAT 1 = HT1 + MR1 x MP HT1 = 10ns; MR1=0.05
AMAT 2 = HT2 + MR2 x MP HT2 = 15ns; MR2=0.03
AMAT2<AMAT1
15 + 0.03x MP < 10 +0.05xMP
5 <0.02MP  MP> 250 ns

Problem-6: Optimization
A cache has hit rate of 95%, block size of 128B, cache hit
latency of 5ns. Main memory takes 50 ns to return first word
(32 bits) of a block and 10 ns for each subsequent word.
(a) What is the miss latency of the cache?
(b) If doubling the cache block size reduces the miss rate to
3%, does it reduces AMAT?
Hr= 0.95; BS= 128B; Ht =5 ns ; 1word= 4B ( 32 bits)

# words/ block = 128B/4B = 32
(A) MP = 50 + (31x10) = 360 ns
AMAT1 = 5 + 0.05 x 360 = 23 ns
(B) # words/ block = 256B/4B = 64 ;

MP = 50 + 63 x10 =680 ns
AMAT2 = 5 + 0.03 x 680 = 25.4 ns
Doubling block size will not reduce AMAT
Problem-7: NoC Routing
A packet injected from router 4 with a destination address 16 in
a 5x5 mesh interconnect system reaches router 8 through its
east input port. What are the possible output port(s) for this
packet at router 8 if it uses minimal odd-even routing algorithm?
0(E) 1(O) 2(E) 3(O) 4(E)
Problem-8: NoC Routing
Consider a 25 core machine in which cores are organized as
regular square mesh topology. A packet P1 is generated from
core number 18 destined to core 6. The system follows minimal
north last routing. How many unique minimal paths are there
from 18 to 6?
Problem-10: Router – Switch Arbitration
An input buffered NoC router R that uses age based switch
allocation scheme (higher age has higher priority) and XY
routing receives 4 packets at a given clock cycle. The details
(packet number, age, source, destination) of the packets are
<P1, 3, 1, 13>, <P2, 3, 7, 12>, <P3, 1, 5, 0> and <P4, 2, 4, 9>.
State whether each of the following statement is True/False, if
R is router 5 in a 4x4 mesh NoC?
Pkt Age S D IP OP- OP-
need act
P1 3 1 13 S N N
P2 3 7 12 E W W
P3 1 5 0 L W *
P4 2 4 9 W N *
Problem-10 : Router – Switch Arbitration
Pkt Age S D IP OP- OP-
need act
P1 3 1 13 S N N
P2 3 7 12 E W W
P3 1 5 0 L W *
P4 2 4 9 W N *
P4 enters R through its north input port. - False

Both P2 and P3 wanted west output port at R. – True
At the end of switch allocation phase P1 and P2 will not be granted with
productive output port and they will remain in their buffer. - False
There exits an output port conflict between P1 and P4, but P1 wins the switch
allocation stage. – True
At the end of switch allocation phase P3 obtains the north output port and
proceeds through the crossbar switch to the north neighbor of R. - False
johnjose@iitg.ac.in
http://www.iitg.ac.in/johnjose/

15IF11 Multicore E PDF

Uploaded by

Copyright:

Available Formats

15IF11 Multicore E PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

15IF11 Multicore E PDF

Uploaded by

Copyright:

Available Formats

15IF11: Multicore Technology @ PSG Tech, Coimbatore

Dr. John Jose

(a) S = 1/ { (1-f) + (f/N) } = 1 / { (1- 0.2) + (0.2/2) } = 1.11 times

(b) S = 1/ { (1-f1-f2) + (f1/N1) + (f2/N2) }

A on P1 (1GHz  1ns) B on P2 (1.4 GHz0.714ns)

A on P1 (1GHz  1ns) B on P2 (1.4 GHz0.714ns)

(a) CPI A_P1=(0.2x5 + 0.4x2 + 0.4x1.5) = 2.4

(a) (b) CPI B_P2=(0.25x5 + 0.5x3 + 0.25x1.5) = 3.125

Instructions reach WB at clock cycles 5, 9, 13, 17, 21, 25, 29,…..

Instructions reach WB at clock cycles 5,7, 8,10, 11,13, 14,16

Tag = 16 Index =10 Offset=6 (4+2)

0x ABC89984 = 1001 1001 1000 0100  Set 614, word 1

0x 485669AC = 0110 1001 1010 1100  Set 422, word 11

5 <0.02MP  MP> 250 ns

Hr= 0.95; BS= 128B; Ht =5 ns ; 1word= 4B ( 32 bits)

(B) # words/ block = 256B/4B = 64 ;

P4 enters R through its north input port. - False

You might also like