Lecture 07 - Performance Measurements - Single and Multiple Cycle Processor Designs
Lecture 07 - Performance Measurements - Single and Multiple Cycle Processor Designs
Embedded Systems
SPRING 2023
Why some computer hardware performs better at some programs, but performs
less at other programs?
Throughput
Number of tasks the machine can run in a given period of time
1
PerformanceX =
Execution timeX
Counts everything:
CPU cycles
CPU Execution Time = CPU cycles × cycle time =
Clock rate
6 Embedded Systems Dr. Tarek Abdul Hamid
Improving Performance
To improve performance, we need to
Reduce number of clock cycles required by a program, or
Reduce clock cycle time (increase the clock rate)
Example:
A program runs in 10 seconds on computer X with 2 GHz clock
What is the number of CPU cycles on computer X ?
We want to design computer Y to run same program in 6 seconds
But computer Y requires 10% more cycles to execute program
What is the clock rate for computer Y ?
Solution:
CPU cycles on computer X = 10 sec × 2 × 109 cycles/s = 20 × 109
CPU cycles on computer Y = 1.1 × 20 × 109 = 22 × 109 cycles
Clock rate for computer Y = 22 × 109 cycles / 6 sec = 3.67 GHz
I1 I2 I3 I4 I5 I6 I7 CPI = 14/7 = 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 cycles
Important point
Changing the cycle time often changes the number of cycles required for
various instructions (more later)
Program X X
Compiler X X
ISA X X X
Organization X X
Technology X
Using the Performance Equation
Suppose we have two implementations of the same ISA
For a given program
Machine A has a clock cycle time of 250 ps and a CPI of 2.2
Machine B has a clock cycle time of 500 ps and a CPI of 1.0
Which machine is faster for this program, and by how much?
Solution:
Both computers execute same count of instructions = I
CPU execution time (A) = I × 2.2 × 250 ps = 550 × I ps
CPU execution time (B) = I × 1.0 × 500 ps = 500 × I ps
550 × I
Computer B is faster than A by a factor = = 1.1
500 × I
n ∑ (CPI × C ) i i
∑ (CPI × C )
i=1
CPU cycles = CPI = n
i i
i=1 ∑C
i=1
i
a) Assume that peak performance is defined as the fastest rate that a computer
can execute any instruction sequence. What are the peak performances of M1
and M2 expressed in instructions per second?
b) Average CPI
= on M1 = (2 × 1 +× 2 + 3 + 4 + 3) / 6 = 14 / 6 = 2.33
Average CPI on M2 = (2 × 2 + 2 + 2 + 4 + 4) / 6 = 16 / 6 = 2.67
Reduce power by
Reducing frequency
Reducing voltage
Energy efficiency
Logical design
Putting gates (AND, NAND, …) and flip-flops together to build basic blocks
such as registers, ALU’s etc (cf. CS 221)
Register transfer
Describes execution of instructions by showing data flow between the basic
blocks
System description
Includes memory hierarchy, I/O, multiprocessing etc
Op6 immediate26
Jump (J-type): j
Concepts used to implement the MIPS subset are used to construct a broad
spectrum of computers
Clock cycle
Note: the same storage
element can be read /written
in the same cycle
31 Embedded Systems
Dr. Tarek Abdul Hamid
Clocking Methodology
Clocks are needed in a sequential logic to
We assume edge-
decide when a state element (register) should
triggered clocking
be updated
All state changes occur
To ensure correctness, a clocking methodology
on the same clock edge
defines when data can be written and read
Data must be valid and
stable before arrival of
clock edge
Edge-triggered clocking
Register 2
allows a register to be
Register 1
Memory
Data path
bus
Control unit
Sends signals to data path elements
Tells what data to move, where to move it, what operations are to be
performed
Memory hierarchy
Holds program and data
Input bus
Output bus
Input bus 0
ALU Output bus
Input bus 1
ALU control
(opcode/function)
Read data 0
Register file ALU
Read data 1
ALU control
(opcode/function)
Write data input bus
Read Reg #0
Read data 0
Read Reg #1
ALU
Write Reg #
Instruction address
Instr. memory
PC
4
Adder
32-bit
Sign.
“store” data
16-bit offset ext Write enable
Mux
ALU
4 Sftl 2
Adder
32-bit
Instruction
Inst.
PC 16-bit
memory Sign.
ext
45 Embedded Systems
Dr. Tarek Abdul Hamid
Drawbacks of
Single Cycle Processor
Long cycle time
All instructions take as much time as the slowest
Instruction fetch
40% ALU, 20% Loads, 10% stores, 20% branches, & 10% jumps
48
Embedded Systems Dr. Tarek Abdul Hamid
Solution
Instruction Instruction Register ALU Data Register
Total
Class Memory Read Operation Memory Write
ALU 200 150 180 150 680 ps
Load 200 150 180 200 150 880 ps
Store 200 150 180 200 730 ps
Branch 200 150 180 530 ps
Jump 200 150 decode and update PC 350 ps
Ignore the other delays in the multiplexers, control unit, sign-extension, etc.
Assume the following instruction mix: 30% ALU, 15% multiply & divide, 20% load, 10%
store, 15% branch, and 10% jump.
a) What is the total delay for each instruction class and the clock cycle for the single cycle
CPU design.
b) Assume we fix the clock cycle to 200 ps for a multi-cycle CPU, what is the CPI for
each instruction class and the speedup over a fixed-length clock cycle?
Solution
a)
b)
CPI for Basic ALU = 4 cycles
CPI for Multiply & Divide = 6 cycles
CPI for Load = 5 cycles
CPI for Store = 4 cycles
CPI for Branch = 3 cycles
CPI for Jump = 2 cycles