Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

9-Memory Design (Module4) - 18-Dec-2019Material - I - 18-Dec-2019 - Module - 4A - Memory - Design

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 134

1

MEMORY DESIGN
Module 4
2

Memory Design
• Available Memory chip Size MN, W: N × W

• Required memory size: N1 × W1, Where N1 ≥ N and W1 ≥ W

• Required number of MN, W chips: p × q, Where p = N1 / N and q

= W1 / W
3

Memory design
There are 3 types of organizations of N1 × W1 that can
be formed using N × W
• N1 = N and W1 > W => increasing the word size of the
chip
• N1 > N and W1 = W => increasing the number of
words in the memory
• N1 > N and W1 > W => increasing both the number of
words and number of bits in each word.
4
Memory design – Increasing the word size
• Problem - 1
• Design 128 × 16 - bit RAM using 128 × 4 - bit RAM
• Solution: p = 128 / 128 = 1; q = 16 / 4 = 4
• Therefore, p × q = 1 × 4 = 4 memory chips of size 128 × 4 are required to
construct 128 × 16 bit RAM

S.No Memory N×W N1 × W1 p q p*q x y z Total


Type

1 RAM 128 × 4 128 × 16 1 4 4 7 0 0 7

x – number of address lines


y (p = 2y)
z – to select the type of memory
5

Memory Address Map


Component Hexadecimal address Address Bus

From To 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

RAM 1.1 0000 007F 10x 10x 10x 10x 10x 10x 10x

RAM 1.2 0000 007F x x x x x x x

RAM 1.3 0000 007F x x x x x x x

RAM 1.4 0000 007F x x x x x x x

Substitute 0 in place of x to get ‘From’ address and 1 to get ‘To’ address


6

Memory design – Increasing the word size

Data Bus
16

4 4 4 4
Data (0-3) Data (0-3) Data (0-3) Data (0-3)

Address (0-6) Address (0-6) Address (0-6) Address (0-6)

128 × 4 128 × 4 128 × 4 128 × 4


RAM RAM RAM RAM

Address CS R/W CS R/W CS R/W CS R/W


Bus
7
Chip Select

Read/write Control
7

Memory Design

Data r/w
6-0
16
7

4 4 4 4
1
8

Memory Design – Increasing the number of words


• Problem - 2
• Design 1024 × 8 - bit RAM using 256 × 8 - bit RAM
• Solution: p = 1024 / 256 = 4; q = 8 / 8 = 1
• Therefore, p × q = 4 × 1 = 4 memory chips of size 256 × 8 are required to
construct 1024 × 8 bit RAM

S.NO P q p*q x
Memory NxW N1 x W1 y z Total

1024
1 RAM 256 × 8 4 1 4 8 2 0 10
×8
2
3
4
9

Memory Address Map


Component Hexadecimal address Address Bus

From To 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

RAM 1 0000 00FF 0 0 1x0 10x 10x 10x 10x 10x 10x 10x

RAM 2 0100 01FF 0 1 x x x x x x x x

RAM 3 0200 02FF 1 0 x x x x x x x x

RAM 4 0300 03FF 1 1 x x x x x x x x

Substitute 0 in place of x to get ‘From’ address and 1 to get ‘To’ address


10
Memory Design – Increasing the number of words
Data
8 Address Bus 256 × 8 Bus
Address Bus 8 RAM 8
A0 – A7 R/W
CS

Data
Address Bus 256 × 8 Bus
0 8 RAM 8
1
A8 CS R/W
2×4 2
A9 decoder
Data
3 Address Bus 256 × 8 Bus
8 RAM 8
CS R/W

Data
Address Bus 256 × 8 Bus
8 RAM 8
CS R/W
8 R/W
Data
Bus
11
256 × 8
987-0 RAM 1 Data r/w

8
8

256 × 8
RAM 2

8
2×4
Decoder
256 × 8
3 2 1 0 RAM 3

256 × 8
RAM 4

8
12

Memory Design
• Problem - 3
• Design 256 × 16 – bit RAM using 128 × 8 – bit RAM chips

S.NO P q p*q x
Memory NxW N1 x W1 y z Total

1 RAM 128 × 8 256 × 16 2 2 4 7 1 0 8

4
13

76-0 Data r/w

Address Bus
128 × 8 128 × 8
RAM 1.1 RAM 1.2
1×2
Decoder
1 0
16
8
8

128 × 8 128 × 8
RAM 2.1 RAM 2.2
16

8
8
14

Memory Design
• Problem - 4
• Design 256 × 16 – bit RAM using 256 × 8 – bit RAM chips and
256 × 8 – bit ROM using 128 × 8 – bit ROM chips.
S.NO P q p*q x
Memory NxW N1 x W1 y z Total

256 × 256 ×
1 RAM 1 2 2 8 0 1 9
8 16
128 ×
2 Rom 256 × 8 2 1 2 7 1 1 9
8
3
4
15

Memory Address Map


Component Hexadecimal address Address Bus

From To 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

RAM 1.1 0000 00FF 0 x x x x x x x x

RAM 1.2 0000 00FF 0 x x x x x x x x

ROM 1 0100 017F 1 0 x x x x x x x

ROM 2 0180 01FF 1 1 x x x x x x x


16
Address Bus
128 × 8
876-0 ROM 1
Data r/w

1×2 1×2 8
Decoder Decoder
1 0 1 0
128 × 8
ROM 2

256 × 8 256 × 8
RAM 1.1 RAM 1.2

16

8
8
17
Memory design
Problem – 5
A computer employs RAM chips of 128 x 8 and ROM chips of
512 x 8. The computer system needs 256 bytes of RAM, 1024 x
16 of ROM, and two interface units with 256 registers each. A
memory mapped I/O configuration is used. The two higher -order
bits of the address bus are assigned 00 for RAM, 01 for ROM,
and 10 for interface registers.
a) Compute total number of decoders are needed for the above
system?
b) Design a memory-address map for the above system
c) Show the chip layout for the above design
18

Requirements
S.NO p*q x
Memory NxW N 1 x W1 P q y z Total

1 RAM 128 × 8 256 × 8 2 1 2 7 1 2 10


2 ROM 512 × 8 1024 × 16 2 2 4 9 1 2 12
3 Interface 256 2 1 2 8 1 2 11

q is 1 always for interfaces.


Number of registers = 2x
P = number of interfaces
Number of data lines = size of registers
19

Component Hexadecimal Address Address Bus

From To 15 - 12 11 1 9 8 7 6 5 4 3 2 1 0
0
RAM1 0000 007F 0 0 0 x x x x x x x

RAM2 0200 027F 0 0 1 x x x x x x x

ROM1.1 0400 05FF 0 1 0 x x x x x x x x x

ROM1.2 0400 05FF 0 1 0 x x x x x x x x x

ROM2.1 0600 07FF 0 1 1 x x x x x x x x x

ROM2.2 0600 07FF 0 1 1 x x x x x x x x x

Interface1 0800 08FF 1 0 0 x x x x x x x x

Interface2 0A00 0AFF 1 0 1 x x x x x x x x


Data r/w Ch. Address
Select
Data r/w Ch. Address
Select
5
4
3
Decoder
2 3×8
1
0
0
1 11
r/w
9876-0
Data
20 Address Bus
21

CACHE IN EMBEDDED
SYSTEM
22

• Before moving into cache some discussion related to the


memory is necessary.
• In a processor, memory is used as a long term or a medium
term storage.
• The stored information can be classified either as program or
data.
• The program information consist of sequence of instructions
that causes the processor to carry out the desired system
function.
• Data information represents the value being input or output
and are transformed by the program
23

• The program and data can be stored together or separately.


• In Von- Neumann or Princeton architecture, the data and
program words share the same memory space.
• In Harvard architecture, the data and program words share
different memory space.
• Harvard architecture at a single time can fetch data and
instruction simultaneously, so the performance improves when
compared to von-Neumann or Princeton Where it is not possible
to fetch data and instruction simultaneously.
24
25

• The memory used may be ROM or RAM. ROM is more compact


than RAM.
• An embedded system often uses ROM for program memory,
because embedded system program does not change.
• Constant data is stored in ROM and any variable data needs
RAM.
• Memory may be on-chip or off-chip. On – chip memory resides
on same IC , off- chip resides on different IC.
• On-chip memories are accessed faster ( one cycle) but only
limited amount of on-chip memory.
26

• To reduce the time needed to access the memory, a local copy


of a portion of memory may be kept in a small but especially
fast memory called cache.
• Cache memory often resides on-chip and often uses a fast but
expensive Static RAM.
• Cache memory is based on the principle that if at a particular
time a processor access a particular memory location, then the
processor will likely to access that location and immediate
neighbors of that location in near future.
27

CACHE MEMORY
Fast/expensive technology, usually on the same chip

Processor

Cache

Memory
Slower/cheaper technology, usually on a different
chip
28

Types of Locality
• Types of locality
• Temporal locality
• recently accessed items are likely to be accessed in the near future.
• Spatial locality
• items whose addresses are near one another tend to be referenced close
together in time.
29
Start

Receive Address from CPU

Is No Access Main Memory for


Block Containing Item the block containing the item
in the cache

Yes
Select the cache line to receive
the block from Main Memory
Deliver Block To CPU

Load main Memory Deliver block


Block into cache To CPU

Done
31
Direct Mapping
Block
Management Set Associative

Fully Associative

Tag
Block
Index
Identification
Offset
Cache Memory
Management Techniques FCFS
Block
LRU
Replacement
Random

Write Through

Write Back
Update Policies
Write Around

Write Allocate
32

CACHE MAPPING TECHNIQUES


• Cache mapping is the method of assigning main memory address
to the cache memory address and for determining whether a
particular main memory address contents are in cache.
• Three techniques are used:
1) Direct Mapping.
2) Fully Associative Mapping.
3) Set Associative Mapping.
33

Example
34

Direct Mapping
15
14
14
13
7
12
6
11
5
10
4
9
3
8
2
7
1
6
0
5
4
3 Cache
2
1 (MM Block address) mod (Number of lines in a cache)
0
(12) mod (8) =4
Main Memory
35

DIRECT MAPPING
36

TAG & VALID BIT


37

HOW BIG THE CACHE IS?


38

MEMORY SYSTEM PERFORMANCE


39

AVERAGE MEMORY ACCESS TIME


40

EXAMPLE
41

SPATIAL LOCALITY
42

BLOCK ADDRESS
43

CACHE MAPPING
44

DATA PLACEMENT WITHIN A BLOCK


45

LOCATING DATA WITHIN CACHE


46
47
48
49

USING ARITHMETIC
50

LARGER CACHE
51

EXAMPLE
52
53

DISADVANTAGES OF DIRECT MAPPING


54
55

FULLY ASSOCIATIVE MAPPING


56
57

SET ASSOCIATIVE MAPPING


58

LOCATING SET ASSOCIATIVE


BLOCK
59

EXAMPLE
60

BLOCK REPLACEMENT
61
62
63

Cache-replacement policy
• Technique for choosing which block to replace
• when fully associative cache is full
• when set-associative cache’s line is full
• Direct mapped cache has no choice
• Random
• replace block chosen at random
• LRU: least-recently used
• replace block not accessed for longest time
• FIFO: first-in-first-out
• push block onto queue when accessed
• choose block to replace by popping queue
64

Cache write techniques


• When written, data cache must update main memory
• Write-through
• write to main memory whenever cache is written to
• easiest to implement
• processor must wait for slower main memory write
• potential for unnecessary writes
• Write-back
• main memory only written when “dirty” block replaced
• extra dirty bit for each block set when cache block written to reduces
number of slow main memory writes
65

Cache impact on system performance


• Most important parameters in terms of performance:
• Total size of cache
• total number of data bytes cache can hold
• tag, valid and other house keeping bits not included in total
• Degree of associativity
• Data block size
• Larger caches achieve lower miss rates but higher access cost
• e.g.,
• 2 Kbyte cache: miss rate = 15%, hit cost = 2 cycles, miss cost = 20 cycles
• avg. cost of memory access = (0.85 * 2) + (0.15 * 20) = 4.7 cycles
• 4 Kbyte cache: miss rate = 6.5%, hit cost = 3 cycles, miss cost will not change
• avg. cost of memory access = (0.935 * 3) + (0.065 * 20) = 4.105 cycles
(improvement)
• 8 Kbyte cache: miss rate = 5.565%, hit cost = 4 cycles, miss cost will not change
• avg. cost of memory access = (0.94435 * 4) + (0.05565 * 20) = 4.8904 cycles
(worse)
66

Cache performance trade-offs


• Improving cache hit rate without increasing size
• Increase line size
• Change set-associativity

0.16
0.14
0.12
% cache 0.1 1 way
0.08 2 way
miss 4 way
0.06 8 way
0.04
0.02
0 cache size
1 Kb 2 Kb 4 Kb 8 Kb16 Kb32 Kb64 Kb128 Kb
Comparison of Cache Mapping Techniques
• There is a critical trade-off in cache performance that has led to
the creation of the various cache mapping techniques described
in the previous section. In order for the cache to have good
performance you want to maximize both of the following:
• Hit Ratio: You want to increase as much as possible the
likelihood of the cache containing the memory addresses that
the processor wants. Otherwise, you lose much of the benefit of
caching because there will be too many misses.
• Search Speed: You want to be able to determine as quickly as
possible if you have scored a hit in the cache. Otherwise, you
lose a small amount of time on every access, hit or miss, while
you search the cache.
67
Comparison of Cache Mapping Techniques…..
Cache Type Hit Ratio Search Speed
Direct Mapped Good Best
Fully Associative Best Moderate
N-Way Set Very Good, Good, Worse as
Associative, N>1 Better as N N Increases
Increases

What are the Advantages and disadvantages of cache mapping techniques?


Advantages:
1) Faster memory access
2) Higher CPU Utilization
Disadvantages:
1) Cost Factor
2) Cache coherency

68
69

BASIC ARCHITECTURE
70

BASIC ARCHITECTURE
• Control unit and data- Processor
path Control unit Datapath

• Note similarity to single- ALU


Controller
purpose processor Control
/Status
• Key differences
Registers
• Data-path is general
• Control unit doesn’t store
the algorithm – the PC IR
algorithm is
“programmed” into the
memory I/O
Memory
71

Datapath Operations
• Load
Processor
• Read memory location Control unit Datapath
into register
ALU
• ALU operation Controller Control +1
– Input certain registers /Status

through ALU, store Registers


back in register
• Store
10 11
– Write register to PC IR
memory location

I/O
...
Memory
10
...
11
72

Control Unit
• Control unit: configures the datapath
operations Processor
• Sequence of desired operations Control unit Datapath
(“instructions”) stored in memory –
“program” ALU
Controller Control
• Instruction cycle – broken into several
/Status
sub-operations, each one clock cycle,
e.g.: Registers
• Fetch: Get next instruction into IR
• Decode: Determine what the instruction
means
• Fetch operands: Move data from PC IR R0 R1
memory to datapath register
• Execute: Move data through the ALU
• Store results: Write data from register to I/O
memory ...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
73

Control Unit Sub-Operations


• Fetch Processor
• Get next instruction into Control unit Datapath

IR ALU
Controller
• PC: program counter, Control
/Status
always points to next
instruction Registers

• IR: holds the fetched


instruction
PC 100 IR R0 R1
load R0, M[500]

I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
74

Control Unit Sub-Operations


• Decode Processor
• Determine what the Control unit Datapath

instruction means ALU


Controller Control
/Status

Registers

PC 100 IR R0 R1
load R0, M[500]

I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
75

Control Unit Sub-Operations


• Fetch operands Processor
• Move data from memory Control unit Datapath

to datapath register ALU


Controller Control
/Status

Registers

10
PC 100 IR R0 R1
load R0, M[500]

I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
76

Control Unit Sub-Operations


• Execute Processor
• Move data through the Control unit Datapath

ALU ALU
Controller
• This particular instruction Control
/Status
does nothing during this
sub-operation Registers

10
PC 100 IR R0 R1
load R0, M[500]

I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
77

Control Unit Sub-Operations


• Store results Processor
• Write data from register to Control unit Datapath

memory ALU
Controller
• This particular instruction Control
/Status
does nothing during this
sub-operation Registers

10
PC 100 IR R0 R1
load R0, M[500]

I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
78

Instruction Cycles
PC=100 Processor

Fetch Decode Fetch Exec. Store Control unit Datapath


ops results ALU
clk Controller Control
/Status

Registers

10
PC 100 IR R0 R1
load R0, M[500]

I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
79

Instruction Cycles
PC=100 Processor

Fetch Decode Fetch Exec. Store Control unit Datapath


ops results ALU
clk Controller Control +1
/Status

PC=101
Registers
Fetch Decode Fetch Exec. Store
ops results
clk
10 11
PC 101 IR R0 R1
inc R1, R0

I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
80

Instruction Cycles
PC=100 Processor

Fetch Decode Fetch Exec. Store Control unit Datapath


ops results ALU
clk Controller Control
/Status

PC=101
Registers
Fetch Decode Fetch Exec. Store
ops results
clk
10 11
PC 102 IR R0 R1
store M[501], R1

PC=102
Fetch Decode Fetch Exec. Store I/O
ops results ...
100 load R0, M[500] Memory
clk 500 10
101 inc R1, R0
102 store M[501], R1
...
501 11
81

Architectural Considerations
• N-bit processor Processor
• N-bit ALU, registers, buses, Control unit Datapath

memory data interface ALU


Controller
• Embedded: 8-bit, 16-bit, 32- Control
/Status
bit common
• Desktop/servers: 32-bit, Registers

even 64
• PC size determines
PC IR
address space
I/O
Memory
82

Architectural Considerations
• Clock frequency Processor
• Inverse of clock period Control unit Datapath

• Must be longer than longest ALU


Controller
register to register delay in Control
/Status
entire processor
• Memory access is often the Registers

longest

PC IR

I/O
Memory
83
84

8051 Basic Architecture


85

CPU Architecture of PIC

 Speed: Harvard Architecture, RISC architecture, 1 instruction cycle = 4 clock cycles.


 Instruction set simplicity: The instruction set consists of just 35 instructions (as opposed to 111 instructions for 8051).
 Power-on-reset and brown-out reset.
A watch dog timer (user programmable)
 PIC microcontroller has four optional clock sources.   Low power crystal, Mid range crystal, High range crystal, RC
oscillator (low cost).
 Programmable timers and on-chip ADC.
 Up to 12 independent interrupt sources.
 Powerful output pin control (25 mA (max.) current sourcing capability per pin.)
 EPROM/OTP/ROM/Flash memory option.
 I/O port expansion capability.
86

DSP Architecture
87

ARM Architecture
88

CISC Feature
• Complex instruction set computer
• Large number of instructions (~200-300 instructions)
• Specialized complex instructions
• Many different addressing modes
• Variable length instruction format
• For Example : 68000, 80x86
89

RISC Feature
• Reduced instruction set computer
• Relatively few number of instructions (~50)
• Basic instructions
• Relatively few different addressing modes
• Fixed length instruction format
• Only load/store instructions can access memory
• Large number of registers
• For Example : MIPS, Alpha, ARM etc.
90

CISC vs RISC
• CISC -- High Code Density
• Fewer instructions needed to specify the algorithm
• RISC -- Simpler to Design
• Higher Performance
• Lower power consumption
• Easier to develop compilers to take advantage of all features
91

CISC vs RISC
• CISC -- High Code Density
• Fewer instructions needed to specify the algorithm
• RISC -- Simpler to Design
• Higher Performance
• Lower power consumption
• Easier to develop compilers to take advantage of all features
92

CISC
• Intel: 80x86
• Motorola: 680x0

RISC
• Sun : Sparc
• Silicon Graphics : MIPS
• HP : PA-RISC
• IBM: Power PC
• Compaq: Alpha
93

DMA
• Direct Memory Access: a bus operation not controlled
by CPU
• Controlled by DMA controller (a bus master)
• 2 additional wires
• Bus request & Bus grant

bus grant

DMAC Device
bus request
CPU

Memory
94

PIPELINE
26

Pipelining

Laundry Example
A B C D
• A, B, C, D each have one load of
clothes to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
27

What Is Pipelining
6 PM 7 8 9 10 11 Midnight

Time

30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
C
d
e
r D

Sequential laundry takes 6 hours for 4 loads


If they learned pipelining, how long would laundry take?
97

What Is Pipelining
Start work ASAP
6 PM 7 8 9 10 11 Midnight

Time

30 40 40 40 40 20
T
a A
s • Pipelined laundry takes 3.5
k hours for 4 loads
B
O
r
C
d
e
r D
98

Pipelining
• A technique of decomposing a sequential process into sub operations,

with each sub process being executed in a partial dedicated segment that
operates concurrently with all other segments.

• Each segment consist of a register and a combinational logic. The

register is used as a junction between two segments.

Eg: Ai * Bi + Ci for i = 1, 2, 3, ... , 7


99
100

• R1  Ai, R2  Bi Load Ai and Bi


• R3  R1 * R2, R4  Ci Multiply and load Ci
• R5  R3 + R4 Add
101

Operations in each pipeline stage


102

General Consideration& Space-Time Diagram


103

• Consider the case


• No. of segments = k
• No. of Tasks =n
• Clock Cycle Time = tp
• The first task takes k*tp clock cycle time to complete
its operation.
• The remaining (n-1) task emerges from the pipeline at
a rate of one task per clock cycle and they will be
completed after the time equal to (n-1)*tp.
• To complete “n” tasks using “k” segment pipeline
requires k+(n-1) clock cycles.
104

• Consider the space-time diagram where


k=4
n=6
• The time required to complete is
4+(6-1)=9 clock cycles.
105

PIPELINE SPEEDUP
Consider a non pipeline unit that performs same operation and
takes a time equal to tp.

n: Number of tasks to be performed


• Conventional Machine (Non-Pipelined)
tn: Clock cycle
t1: Time required to complete the n tasks
t 1 = n * tn

• Pipelined Machine (k stages)


tp: Clock cycle (time to complete each sub operation)
tk: Time required to complete the n tasks
tk = (k + n - 1) * tp
106

PIPELINE SPEEDUP
• 

tn
lim Sk = ( = k, if tn = k * tp )
n tp
107
108

• There are two areas in the computer design where the pipeline organization is
applicable.
i. Arithmetic pipeline:
It divides the arithmetic operations into sub operations for execution in the
pipeline.
ii. Instruction pipeline:
It operates on the stream of instructions by overlapping the fetch, decode &
execute phase of instruction cycle.
109

ARITHMETIC PIPELINE
• The arithmetic pipeline is used to implement floating point
operations, multiplication of fixed point numbers.
• Lets take an example of floating point operation where the operation
is decomposed in to sub operations.
110

ARITHMETIC PIPELINE
111

FLOWCHART OUTLINE WITH EXAMPLE


• X= 0.9504 X 10³
Y= 0.8200 X 10²
• The exponents are compared by subtracting them to determine their
difference.
• The two exponents are subtracted in the first segment to obtain 3-2=1.
• The larger exponent is chosen as the result.
• The exponent difference determines how many times the mantissa
associated with the smallest exponent has to be shifted.
112

• The larger exponent “3” is chosen as the exponent of the result.


• The mantissa of “Y” is shifted once (3-2=1) to the right to get
Y= 0.0820 X 10³.
• The two mantissas are added or subtracted in segment 3. The result is
normalized in segment 4.
• The result Z = 1.0324 X 10³.
113

• When an overflow occurs, the mantissa of the sum or the difference is


shifted right and the exponent is incremented by one.
• The result of the sum is adjusted by normalizing the result so that it has
fraction with the non zero first digit.
• Shift the mantissa once to right and increment the exponent by one .
Z=0.10324 X 104.
114

• If underflow occurs, then the no. of leading zeros in the mantissa


determines the number of left shift in the mantissa and the number
must be subtracted from the exponent.
• The comparator, shifter, adder- subtractor , incrementer and
decrementer in the floating point pipeline are implemented with
combinational circuits.
115

• Suppose the time delays of the 4 segments are t1=60ns , t2=70ns, t3=
100ns, t4=80ns, and the interface register have tr=10ns.
• The clock cycle tp is chosen to be tp= t3+tr= 100+10=110ns.
• An equivalent non pipeline floating point adder- subtractor will have
delay time of tn= t1+t2+t3+t4=320ns.
• Speed up is of 320/110=2.9 over non pipelined adder.
116

INSTRUCTION PIPELINE
• Pipeline process occurs not only in data stream but also in instruction
stream as well.
• Instruction pipeline reads consecutive instructions from the memory
while the previous instruction was being executed in other segment.
• The problem occurs in branch instruction . In that case the pipeline
must be emptied and the instruction after the pipeline must be
discarded.
117

1. Fetch the instruction from the memory


2. Decode the instruction
3. Calculate the effective address
4. Fetch the operand from the memory
5. Execute the instruction
6. Store the result in a proper place.
118

• The certain difficulties that will prevent the pipeline from operating at
the maximum rate.
• Different segment may take different time to operate on the incoming
information.
• Two or more segments may require memory access at the same time,
causing one segment to wait until another is finished with the memory.
119

• But it can be resolved by going for two memory bus for accessing
instruction and data in a separate memory.
• The design of pipeline will be efficient if the instruction cycle is divided
into segments of equal duration.
120

4 STAGE PIPELINE
1 FI: Fetch an instruction from memory
2 DA: Decode the instruction and calculate
the effective address of the operand
3 FO: Fetch the operand
4 EX: Execute the operation & store the result.
121
122

INSTRUCTION EXECUTION IN 4 STAGE PIPELINE


123

TIMING OF INSTRUCTION
PIPELINE
124

• It is assumed that the processor has separate instruction and data


memory.
• Thus in step 4, instruction 1 is being executed, operand for instruction 2
is fetched, instruction 3 is being decoded and instruction 4 is being
fetched from memory.
125

• Assume that now the instruction 3 is a branch instruction.


• As soon as the branch instruction is decoded in step 4, the decoding of
following instruction will be stopped until the branch instruction is
executed.
• Once it is executed then only the next instruction’s address will be
known and the processor fetches from that address.
126

• Another delay may occur in pipeline if the Execute segment needs to


store the value in the data memory and Fetch operand segment fetches
the operands for the instruction from the data memory.
• In that case the fetch operand segment must wait till the execute
segment finishes storing the value.
127

PIPELINE HAZARDS
• STRUCTURAL HAZARD
• DATA HAZARD
• CONTROL HAZARD
128

STRUCTURAL HAZARD

• Hardware Resources required by the instructions in


simultaneous overlapped execution cannot be met.

• When two segments access the memory at the


same time it leads to structural hazard.

• The Structural Hazard is also called as Resource


Conflicts.
129

• Occur when some resource has not been duplicated enough to allow all
combinations of instructions in the pipeline to execute
• Example: With one memory-port, a data and an instruction fetch
cannot be initiated in the same clock
130
131

DATA HAZARD
• An instruction scheduled to be executed in the pipeline requires the result
of the previous instruction which has not arrived.
• E.g. ADD R1, R2, R3
SUB R4, R1, R5
• Data hazard can be dealt with either hardware techniques or software
technique
132

HARDWARE TECHNIQUE
Interlock
hardware detects the data dependencies and delays the scheduling of the
dependent instruction by stalling enough clock cycles
Forwarding (bypassing, short-circuiting)
Accomplished by a data path that routes a value from a source (usually an ALU) to
a user, bypassing a designated register. This allows the value to be produced to be
used at an earlier stage in the pipeline than would otherwise be possible
133

FORWARDING HARDWARE
134

INSTRUCTION SCHEDULING
135

CONTROL HAZARDS

You might also like