9-Memory Design (Module4) - 18-Dec-2019Material - I - 18-Dec-2019 - Module - 4A - Memory - Design

1
MEMORY DESIGN
Module 4
2
Memory Design
• Available Memory chip Size MN, W: N × W
• Required memory size: N1 × W1, Where N1 ≥ N and W1 ≥ W
• Required number of MN, W chips: p × q, Where p = N1 / N and q
= W1 / W
3
Memory design
There are 3 types of organizations of N1 × W1 that can
be formed using N × W
• N1 = N and W1 > W => increasing the word size of the
chip
• N1 > N and W1 = W => increasing the number of
words in the memory
• N1 > N and W1 > W => increasing both the number of
words and number of bits in each word.
4
Memory design – Increasing the word size
• Problem - 1
• Design 128 × 16 - bit RAM using 128 × 4 - bit RAM
• Solution: p = 128 / 128 = 1; q = 16 / 4 = 4
• Therefore, p × q = 1 × 4 = 4 memory chips of size 128 × 4 are required to
construct 128 × 16 bit RAM
S.No Memory N×W N1 × W1 p q p*q x y z Total

Type
1 RAM 128 × 4 128 × 16 1 4 4 7 0 0 7
x – number of address lines

y (p = 2y)
z – to select the type of memory
5
Memory Address Map

Component Hexadecimal address Address Bus
From To 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
RAM 1.1 0000 007F 10x 10x 10x 10x 10x 10x 10x
RAM 1.2 0000 007F x x x x x x x
RAM 1.3 0000 007F x x x x x x x
RAM 1.4 0000 007F x x x x x x x
Substitute 0 in place of x to get ‘From’ address and 1 to get ‘To’ address

6
Memory design – Increasing the word size
Data Bus
16
4 4 4 4
Data (0-3) Data (0-3) Data (0-3) Data (0-3)
Address (0-6) Address (0-6) Address (0-6) Address (0-6)
128 × 4 128 × 4 128 × 4 128 × 4

RAM RAM RAM RAM
Address CS R/W CS R/W CS R/W CS R/W

Bus
7
Chip Select
Read/write Control
7
Memory Design
Data r/w
6-0
16
7
4 4 4 4
1
8
Memory Design – Increasing the number of words

• Problem - 2
• Design 1024 × 8 - bit RAM using 256 × 8 - bit RAM
• Solution: p = 1024 / 256 = 4; q = 8 / 8 = 1
• Therefore, p × q = 4 × 1 = 4 memory chips of size 256 × 8 are required to
construct 1024 × 8 bit RAM
S.NO P q p*q x
Memory NxW N1 x W1 y z Total
1024
1 RAM 256 × 8 4 1 4 8 2 0 10
×8
2
3
4
9
Memory Address Map

From To 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
RAM 1 0000 00FF 0 0 1x0 10x 10x 10x 10x 10x 10x 10x
RAM 2 0100 01FF 0 1 x x x x x x x x
RAM 3 0200 02FF 1 0 x x x x x x x x
RAM 4 0300 03FF 1 1 x x x x x x x x
Substitute 0 in place of x to get ‘From’ address and 1 to get ‘To’ address

10
Memory Design – Increasing the number of words
Data
8 Address Bus 256 × 8 Bus
Address Bus 8 RAM 8
A0 – A7 R/W
CS
Data
Address Bus 256 × 8 Bus
0 8 RAM 8
1
A8 CS R/W
2×4 2
A9 decoder
Data
3 Address Bus 256 × 8 Bus
8 RAM 8
CS R/W
Data
Address Bus 256 × 8 Bus
8 RAM 8
CS R/W
8 R/W
Data
Bus
11
256 × 8
987-0 RAM 1 Data r/w
8
8
256 × 8
RAM 2
8
2×4
Decoder
256 × 8
3 2 1 0 RAM 3
256 × 8
RAM 4
8
12
Memory Design
• Problem - 3
• Design 256 × 16 – bit RAM using 128 × 8 – bit RAM chips
S.NO P q p*q x
1 RAM 128 × 8 256 × 16 2 2 4 7 1 0 8
4
13
76-0 Data r/w
Address Bus
128 × 8 128 × 8
RAM 1.1 RAM 1.2
1×2
Decoder
1 0
16
8
8
128 × 8 128 × 8
RAM 2.1 RAM 2.2
16
8
8
14
Memory Design
• Problem - 4
• Design 256 × 16 – bit RAM using 256 × 8 – bit RAM chips and
256 × 8 – bit ROM using 128 × 8 – bit ROM chips.
S.NO P q p*q x
256 × 256 ×
1 RAM 1 2 2 8 0 1 9
8 16
128 ×
2 Rom 256 × 8 2 1 2 7 1 1 9
8
3
4
15
Memory Address Map

From To 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
RAM 1.1 0000 00FF 0 x x x x x x x x
RAM 1.2 0000 00FF 0 x x x x x x x x
ROM 1 0100 017F 1 0 x x x x x x x
ROM 2 0180 01FF 1 1 x x x x x x x

16
Address Bus
128 × 8
876-0 ROM 1
Data r/w
1×2 1×2 8
Decoder Decoder
1 0 1 0
128 × 8
ROM 2
256 × 8 256 × 8
RAM 1.1 RAM 1.2
16
8
8
17
Memory design
Problem – 5
A computer employs RAM chips of 128 x 8 and ROM chips of
512 x 8. The computer system needs 256 bytes of RAM, 1024 x
16 of ROM, and two interface units with 256 registers each. A
memory mapped I/O configuration is used. The two higher -order
bits of the address bus are assigned 00 for RAM, 01 for ROM,
and 10 for interface registers.
a) Compute total number of decoders are needed for the above
system?
b) Design a memory-address map for the above system
c) Show the chip layout for the above design
18
Requirements
S.NO p*q x
Memory NxW N 1 x W1 P q y z Total
1 RAM 128 × 8 256 × 8 2 1 2 7 1 2 10

2 ROM 512 × 8 1024 × 16 2 2 4 9 1 2 12
3 Interface 256 2 1 2 8 1 2 11
q is 1 always for interfaces.

Number of registers = 2x
P = number of interfaces
Number of data lines = size of registers
19
Component Hexadecimal Address Address Bus
From To 15 - 12 11 1 9 8 7 6 5 4 3 2 1 0
0
RAM1 0000 007F 0 0 0 x x x x x x x
RAM2 0200 027F 0 0 1 x x x x x x x
ROM1.1 0400 05FF 0 1 0 x x x x x x x x x
ROM1.2 0400 05FF 0 1 0 x x x x x x x x x
ROM2.1 0600 07FF 0 1 1 x x x x x x x x x
ROM2.2 0600 07FF 0 1 1 x x x x x x x x x
Interface1 0800 08FF 1 0 0 x x x x x x x x
Interface2 0A00 0AFF 1 0 1 x x x x x x x x

Data r/w Ch. Address
Select
Data r/w Ch. Address
Select
5
4
3
Decoder
2 3×8
1
0
0
1 11
r/w
9876-0
Data
20 Address Bus
21
CACHE IN EMBEDDED
SYSTEM
22
• Before moving into cache some discussion related to the

memory is necessary.
• In a processor, memory is used as a long term or a medium
term storage.
• The stored information can be classified either as program or
data.
• The program information consist of sequence of instructions
that causes the processor to carry out the desired system
function.
• Data information represents the value being input or output
and are transformed by the program
23
• The program and data can be stored together or separately.

• In Von- Neumann or Princeton architecture, the data and
program words share the same memory space.
• In Harvard architecture, the data and program words share
different memory space.
• Harvard architecture at a single time can fetch data and
instruction simultaneously, so the performance improves when
compared to von-Neumann or Princeton Where it is not possible
to fetch data and instruction simultaneously.
24
25
• The memory used may be ROM or RAM. ROM is more compact

than RAM.
• An embedded system often uses ROM for program memory,
because embedded system program does not change.
• Constant data is stored in ROM and any variable data needs
RAM.
• Memory may be on-chip or off-chip. On – chip memory resides
on same IC , off- chip resides on different IC.
• On-chip memories are accessed faster ( one cycle) but only
limited amount of on-chip memory.
26
• To reduce the time needed to access the memory, a local copy

of a portion of memory may be kept in a small but especially
fast memory called cache.
• Cache memory often resides on-chip and often uses a fast but
expensive Static RAM.
• Cache memory is based on the principle that if at a particular
time a processor access a particular memory location, then the
processor will likely to access that location and immediate
neighbors of that location in near future.
27
CACHE MEMORY
Fast/expensive technology, usually on the same chip
Processor
Cache
Memory
Slower/cheaper technology, usually on a different
chip
28
Types of Locality
• Types of locality
• Temporal locality
• recently accessed items are likely to be accessed in the near future.
• Spatial locality
• items whose addresses are near one another tend to be referenced close
together in time.
29
Start
Receive Address from CPU
Is No Access Main Memory for

Block Containing Item the block containing the item
in the cache
Yes
Select the cache line to receive
the block from Main Memory
Deliver Block To CPU
Load main Memory Deliver block

Block into cache To CPU
Done
31
Direct Mapping
Block
Management Set Associative
Fully Associative
Tag
Block
Index
Identification
Offset
Cache Memory
Management Techniques FCFS
Block
LRU
Replacement
Random
Write Through
Write Back
Update Policies
Write Around
Write Allocate
32
CACHE MAPPING TECHNIQUES

• Cache mapping is the method of assigning main memory address
to the cache memory address and for determining whether a
particular main memory address contents are in cache.
• Three techniques are used:
1) Direct Mapping.
2) Fully Associative Mapping.
3) Set Associative Mapping.
33
Example
34
Direct Mapping
15
14
14
13
7
12
6
11
5
10
4
9
3
8
2
7
1
6
0
5
4
3 Cache
2
1 (MM Block address) mod (Number of lines in a cache)
0
(12) mod (8) =4
Main Memory
35
DIRECT MAPPING
36
TAG & VALID BIT

37
HOW BIG THE CACHE IS?

38
MEMORY SYSTEM PERFORMANCE

39
AVERAGE MEMORY ACCESS TIME

40
EXAMPLE
41
SPATIAL LOCALITY
42
BLOCK ADDRESS
43
CACHE MAPPING
44
DATA PLACEMENT WITHIN A BLOCK

45
LOCATING DATA WITHIN CACHE

46
47
48
49
USING ARITHMETIC
50
LARGER CACHE
51
EXAMPLE
52
53
DISADVANTAGES OF DIRECT MAPPING

54
55
FULLY ASSOCIATIVE MAPPING

56
57
SET ASSOCIATIVE MAPPING

58
LOCATING SET ASSOCIATIVE

BLOCK
59
EXAMPLE
60
BLOCK REPLACEMENT
61
62
63
Cache-replacement policy
• Technique for choosing which block to replace
• when fully associative cache is full
• when set-associative cache’s line is full
• Direct mapped cache has no choice
• Random
• replace block chosen at random
• LRU: least-recently used
• replace block not accessed for longest time
• FIFO: first-in-first-out
• push block onto queue when accessed
• choose block to replace by popping queue
64
Cache write techniques

• When written, data cache must update main memory
• Write-through
• write to main memory whenever cache is written to
• easiest to implement
• processor must wait for slower main memory write
• potential for unnecessary writes
• Write-back
• main memory only written when “dirty” block replaced
• extra dirty bit for each block set when cache block written to reduces
number of slow main memory writes
65
Cache impact on system performance

• Most important parameters in terms of performance:
• Total size of cache
• total number of data bytes cache can hold
• tag, valid and other house keeping bits not included in total
• Degree of associativity
• Data block size
• Larger caches achieve lower miss rates but higher access cost
• e.g.,
• 2 Kbyte cache: miss rate = 15%, hit cost = 2 cycles, miss cost = 20 cycles
• avg. cost of memory access = (0.85 * 2) + (0.15 * 20) = 4.7 cycles
• 4 Kbyte cache: miss rate = 6.5%, hit cost = 3 cycles, miss cost will not change
(improvement)
• 8 Kbyte cache: miss rate = 5.565%, hit cost = 4 cycles, miss cost will not change
(worse)
66
Cache performance trade-offs

• Improving cache hit rate without increasing size
• Increase line size
• Change set-associativity
0.16
0.14
0.12
% cache 0.1 1 way
0.08 2 way
miss 4 way
0.06 8 way
0.04
0.02
0 cache size
1 Kb 2 Kb 4 Kb 8 Kb16 Kb32 Kb64 Kb128 Kb
Comparison of Cache Mapping Techniques
• There is a critical trade-off in cache performance that has led to
the creation of the various cache mapping techniques described
in the previous section. In order for the cache to have good
performance you want to maximize both of the following:
• Hit Ratio: You want to increase as much as possible the
likelihood of the cache containing the memory addresses that
the processor wants. Otherwise, you lose much of the benefit of
caching because there will be too many misses.
• Search Speed: You want to be able to determine as quickly as
possible if you have scored a hit in the cache. Otherwise, you
lose a small amount of time on every access, hit or miss, while
you search the cache.
67
Comparison of Cache Mapping Techniques…..
Cache Type Hit Ratio Search Speed
Direct Mapped Good Best
Fully Associative Best Moderate
N-Way Set Very Good, Good, Worse as
Associative, N>1 Better as N N Increases
Increases
What are the Advantages and disadvantages of cache mapping techniques?

Advantages:
1) Faster memory access
2) Higher CPU Utilization
Disadvantages:
1) Cost Factor
2) Cache coherency
68
69
BASIC ARCHITECTURE
70
BASIC ARCHITECTURE
• Control unit and data- Processor
path Control unit Datapath
• Note similarity to single- ALU

Controller
purpose processor Control
/Status
• Key differences
Registers
• Data-path is general
• Control unit doesn’t store
the algorithm – the PC IR
algorithm is
“programmed” into the
memory I/O
Memory
71
Datapath Operations
• Load
Processor
• Read memory location Control unit Datapath
into register
ALU
• ALU operation Controller Control +1
– Input certain registers /Status
through ALU, store Registers

back in register
• Store
10 11
– Write register to PC IR
memory location
I/O
...
Memory
10
...
11
72
Control Unit
• Control unit: configures the datapath
operations Processor
• Sequence of desired operations Control unit Datapath
(“instructions”) stored in memory –
“program” ALU
Controller Control
• Instruction cycle – broken into several
/Status
sub-operations, each one clock cycle,
e.g.: Registers
• Fetch: Get next instruction into IR
• Decode: Determine what the instruction
means
• Fetch operands: Move data from PC IR R0 R1
memory to datapath register
• Execute: Move data through the ALU
• Store results: Write data from register to I/O
memory ...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
73
Control Unit Sub-Operations

• Fetch Processor
• Get next instruction into Control unit Datapath
IR ALU
Controller
• PC: program counter, Control
/Status
always points to next
instruction Registers
• IR: holds the fetched

instruction
PC 100 IR R0 R1
load R0, M[500]
I/O
...
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
74

• Decode Processor
• Determine what the Control unit Datapath
instruction means ALU

Controller Control
/Status
Registers
PC 100 IR R0 R1
load R0, M[500]
I/O
...
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
75

• Fetch operands Processor
• Move data from memory Control unit Datapath
to datapath register ALU

Controller Control
/Status
Registers
10
PC 100 IR R0 R1
load R0, M[500]
I/O
...
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
76

• Execute Processor
• Move data through the Control unit Datapath
ALU ALU
Controller
• This particular instruction Control
/Status
does nothing during this
sub-operation Registers
10
PC 100 IR R0 R1
load R0, M[500]
I/O
...
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
77

• Store results Processor
• Write data from register to Control unit Datapath
memory ALU
Controller
• This particular instruction Control
/Status
does nothing during this
sub-operation Registers
10
PC 100 IR R0 R1
load R0, M[500]
I/O
...
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
78
Instruction Cycles
PC=100 Processor
Fetch Decode Fetch Exec. Store Control unit Datapath

ops results ALU
clk Controller Control
/Status
Registers
10
PC 100 IR R0 R1
load R0, M[500]
I/O
...
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
79
Instruction Cycles
PC=100 Processor

ops results ALU
clk Controller Control +1
/Status
PC=101
Registers
Fetch Decode Fetch Exec. Store
ops results
clk
10 11
PC 101 IR R0 R1
inc R1, R0
I/O
...
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
80
Instruction Cycles
PC=100 Processor

ops results ALU
clk Controller Control
/Status
PC=101
Registers
Fetch Decode Fetch Exec. Store
ops results
clk
10 11
PC 102 IR R0 R1
store M[501], R1
PC=102
Fetch Decode Fetch Exec. Store I/O
ops results ...
clk 500 10
101 inc R1, R0
102 store M[501], R1
...
501 11
81
Architectural Considerations
• N-bit processor Processor
• N-bit ALU, registers, buses, Control unit Datapath
memory data interface ALU

Controller
• Embedded: 8-bit, 16-bit, 32- Control
/Status
bit common
• Desktop/servers: 32-bit, Registers
even 64
• PC size determines
PC IR
address space
I/O
Memory
82
Architectural Considerations
• Clock frequency Processor
• Inverse of clock period Control unit Datapath
• Must be longer than longest ALU

Controller
register to register delay in Control
/Status
entire processor
• Memory access is often the Registers
longest
PC IR
I/O
Memory
83
84
8051 Basic Architecture

85
CPU Architecture of PIC
 Speed: Harvard Architecture, RISC architecture, 1 instruction cycle = 4 clock cycles.

 Instruction set simplicity: The instruction set consists of just 35 instructions (as opposed to 111 instructions for 8051).
 Power-on-reset and brown-out reset.
A watch dog timer (user programmable)
 PIC microcontroller has four optional clock sources. Low power crystal, Mid range crystal, High range crystal, RC
oscillator (low cost).
 Programmable timers and on-chip ADC.
 Up to 12 independent interrupt sources.
 Powerful output pin control (25 mA (max.) current sourcing capability per pin.)
 EPROM/OTP/ROM/Flash memory option.
 I/O port expansion capability.
86
DSP Architecture
87
ARM Architecture
88
CISC Feature
• Complex instruction set computer
• Large number of instructions (~200-300 instructions)
• Specialized complex instructions
• Many different addressing modes
• Variable length instruction format
• For Example : 68000, 80x86
89
RISC Feature
• Reduced instruction set computer
• Relatively few number of instructions (~50)
• Basic instructions
• Relatively few different addressing modes
• Fixed length instruction format
• Only load/store instructions can access memory
• Large number of registers
• For Example : MIPS, Alpha, ARM etc.
90
CISC vs RISC
• CISC -- High Code Density
• Fewer instructions needed to specify the algorithm
• RISC -- Simpler to Design
• Higher Performance
• Lower power consumption
• Easier to develop compilers to take advantage of all features
91
CISC vs RISC
• CISC -- High Code Density
• Fewer instructions needed to specify the algorithm
• RISC -- Simpler to Design
• Higher Performance
• Lower power consumption
• Easier to develop compilers to take advantage of all features
92
CISC
• Intel: 80x86
• Motorola: 680x0
RISC
• Sun : Sparc
• Silicon Graphics : MIPS
• HP : PA-RISC
• IBM: Power PC
• Compaq: Alpha
93
DMA
• Direct Memory Access: a bus operation not controlled
by CPU
• Controlled by DMA controller (a bus master)
• 2 additional wires
• Bus request & Bus grant
bus grant
DMAC Device
bus request
CPU
Memory
94
PIPELINE
26
Pipelining
Laundry Example
A B C D
• A, B, C, D each have one load of
clothes to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
27
What Is Pipelining
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
C
d
e
r D
Sequential laundry takes 6 hours for 4 loads

If they learned pipelining, how long would laundry take?
97
What Is Pipelining
Start work ASAP
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
T
a A
s • Pipelined laundry takes 3.5
k hours for 4 loads
B
O
r
C
d
e
r D
98
Pipelining
• A technique of decomposing a sequential process into sub operations,
with each sub process being executed in a partial dedicated segment that
operates concurrently with all other segments.
• Each segment consist of a register and a combinational logic. The
register is used as a junction between two segments.
Eg: Ai * Bi + Ci for i = 1, 2, 3, ... , 7

99
100
• R1  Ai, R2  Bi Load Ai and Bi

• R3  R1 * R2, R4  Ci Multiply and load Ci
• R5  R3 + R4 Add
101
Operations in each pipeline stage

102
General Consideration& Space-Time Diagram

103
• Consider the case

• No. of segments = k
• No. of Tasks =n
• Clock Cycle Time = tp
• The first task takes k*tp clock cycle time to complete
its operation.
• The remaining (n-1) task emerges from the pipeline at
a rate of one task per clock cycle and they will be
completed after the time equal to (n-1)*tp.
• To complete “n” tasks using “k” segment pipeline
requires k+(n-1) clock cycles.
104
• Consider the space-time diagram where

k=4
n=6
• The time required to complete is
4+(6-1)=9 clock cycles.
105
PIPELINE SPEEDUP
Consider a non pipeline unit that performs same operation and
takes a time equal to tp.
n: Number of tasks to be performed

• Conventional Machine (Non-Pipelined)
tn: Clock cycle
t1: Time required to complete the n tasks
t 1 = n * tn
• Pipelined Machine (k stages)

tp: Clock cycle (time to complete each sub operation)
tk: Time required to complete the n tasks
tk = (k + n - 1) * tp
106
PIPELINE SPEEDUP
•
tn
lim Sk = ( = k, if tn = k * tp )
n tp
107
108
• There are two areas in the computer design where the pipeline organization is
applicable.
i. Arithmetic pipeline:
It divides the arithmetic operations into sub operations for execution in the
pipeline.
ii. Instruction pipeline:
It operates on the stream of instructions by overlapping the fetch, decode &
execute phase of instruction cycle.
109
ARITHMETIC PIPELINE
• The arithmetic pipeline is used to implement floating point
operations, multiplication of fixed point numbers.
• Lets take an example of floating point operation where the operation
is decomposed in to sub operations.
110
ARITHMETIC PIPELINE
111
FLOWCHART OUTLINE WITH EXAMPLE

• X= 0.9504 X 10³
Y= 0.8200 X 10²
• The exponents are compared by subtracting them to determine their
difference.
• The two exponents are subtracted in the first segment to obtain 3-2=1.
• The larger exponent is chosen as the result.
• The exponent difference determines how many times the mantissa
associated with the smallest exponent has to be shifted.
112
• The larger exponent “3” is chosen as the exponent of the result.

• The mantissa of “Y” is shifted once (3-2=1) to the right to get
Y= 0.0820 X 10³.
• The two mantissas are added or subtracted in segment 3. The result is
normalized in segment 4.
• The result Z = 1.0324 X 10³.
113
• When an overflow occurs, the mantissa of the sum or the difference is

shifted right and the exponent is incremented by one.
• The result of the sum is adjusted by normalizing the result so that it has
fraction with the non zero first digit.
• Shift the mantissa once to right and increment the exponent by one .
Z=0.10324 X 104.
114
• If underflow occurs, then the no. of leading zeros in the mantissa

determines the number of left shift in the mantissa and the number
must be subtracted from the exponent.
• The comparator, shifter, adder- subtractor , incrementer and
decrementer in the floating point pipeline are implemented with
combinational circuits.
115
• Suppose the time delays of the 4 segments are t1=60ns , t2=70ns, t3=
100ns, t4=80ns, and the interface register have tr=10ns.
• The clock cycle tp is chosen to be tp= t3+tr= 100+10=110ns.
• An equivalent non pipeline floating point adder- subtractor will have
delay time of tn= t1+t2+t3+t4=320ns.
• Speed up is of 320/110=2.9 over non pipelined adder.
116
INSTRUCTION PIPELINE
• Pipeline process occurs not only in data stream but also in instruction
stream as well.
• Instruction pipeline reads consecutive instructions from the memory
while the previous instruction was being executed in other segment.
• The problem occurs in branch instruction . In that case the pipeline
must be emptied and the instruction after the pipeline must be
discarded.
117
1. Fetch the instruction from the memory

2. Decode the instruction
3. Calculate the effective address
4. Fetch the operand from the memory
5. Execute the instruction
6. Store the result in a proper place.
118
• The certain difficulties that will prevent the pipeline from operating at
the maximum rate.
• Different segment may take different time to operate on the incoming
information.
• Two or more segments may require memory access at the same time,
causing one segment to wait until another is finished with the memory.
119
• But it can be resolved by going for two memory bus for accessing
instruction and data in a separate memory.
• The design of pipeline will be efficient if the instruction cycle is divided
into segments of equal duration.
120
4 STAGE PIPELINE
1 FI: Fetch an instruction from memory
2 DA: Decode the instruction and calculate
the effective address of the operand
3 FO: Fetch the operand
4 EX: Execute the operation & store the result.
121
122
INSTRUCTION EXECUTION IN 4 STAGE PIPELINE

123
TIMING OF INSTRUCTION
PIPELINE
124
• It is assumed that the processor has separate instruction and data

memory.
• Thus in step 4, instruction 1 is being executed, operand for instruction 2
is fetched, instruction 3 is being decoded and instruction 4 is being
fetched from memory.
125
• Assume that now the instruction 3 is a branch instruction.

• As soon as the branch instruction is decoded in step 4, the decoding of
following instruction will be stopped until the branch instruction is
executed.
• Once it is executed then only the next instruction’s address will be
known and the processor fetches from that address.
126
• Another delay may occur in pipeline if the Execute segment needs to

store the value in the data memory and Fetch operand segment fetches
the operands for the instruction from the data memory.
• In that case the fetch operand segment must wait till the execute
segment finishes storing the value.
127
PIPELINE HAZARDS
• STRUCTURAL HAZARD
• DATA HAZARD
• CONTROL HAZARD
128
STRUCTURAL HAZARD
• Hardware Resources required by the instructions in

simultaneous overlapped execution cannot be met.
• When two segments access the memory at the

same time it leads to structural hazard.
• The Structural Hazard is also called as Resource

Conflicts.
129
• Occur when some resource has not been duplicated enough to allow all
combinations of instructions in the pipeline to execute
• Example: With one memory-port, a data and an instruction fetch
cannot be initiated in the same clock
130
131
DATA HAZARD
• An instruction scheduled to be executed in the pipeline requires the result
of the previous instruction which has not arrived.
• E.g. ADD R1, R2, R3
SUB R4, R1, R5
• Data hazard can be dealt with either hardware techniques or software
technique
132
HARDWARE TECHNIQUE
Interlock
hardware detects the data dependencies and delays the scheduling of the
dependent instruction by stalling enough clock cycles
Forwarding (bypassing, short-circuiting)
Accomplished by a data path that routes a value from a source (usually an ALU) to
a user, bypassing a designated register. This allows the value to be produced to be
used at an earlier stage in the pipeline than would otherwise be possible
133
FORWARDING HARDWARE
134
INSTRUCTION SCHEDULING
135
CONTROL HAZARDS

9-Memory Design (Module4) - 18-Dec-2019Material - I - 18-Dec-2019 - Module - 4A - Memory - Design

Uploaded by

Copyright:

Available Formats

9-Memory Design (Module4) - 18-Dec-2019Material - I - 18-Dec-2019 - Module - 4A - Memory - Design

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

9-Memory Design (Module4) - 18-Dec-2019Material - I - 18-Dec-2019 - Module - 4A - Memory - Design

Uploaded by

Copyright:

Available Formats

1

• Required memory size: N1 × W1, Where N1 ≥ N and W1 ≥ W

• Required number of MN, W chips: p × q, Where p = N1 / N and q

S.No Memory N×W N1 × W1 p q p*q x y z Total

1 RAM 128 × 4 128 × 16 1 4 4 7 0 0 7

x – number of address lines

Memory Address Map

RAM 1.2 0000 007F x x x x x x x

RAM 1.3 0000 007F x x x x x x x

RAM 1.4 0000 007F x x x x x x x

Substitute 0 in place of x to get ‘From’ address and 1 to get ‘To’ address

Memory design – Increasing the word size

Address (0-6) Address (0-6) Address (0-6) Address (0-6)

128 × 4 128 × 4 128 × 4 128 × 4

Address CS R/W CS R/W CS R/W CS R/W

Memory Design – Increasing the number of words

Memory Address Map

RAM 2 0100 01FF 0 1 x x x x x x x x

RAM 3 0200 02FF 1 0 x x x x x x x x

RAM 4 0300 03FF 1 1 x x x x x x x x

Substitute 0 in place of x to get ‘From’ address and 1 to get ‘To’ address

1 RAM 128 × 8 256 × 16 2 2 4 7 1 0 8

76-0 Data r/w

Memory Address Map

RAM 1.1 0000 00FF 0 x x x x x x x x

RAM 1.2 0000 00FF 0 x x x x x x x x

ROM 1 0100 017F 1 0 x x x x x x x

ROM 2 0180 01FF 1 1 x x x x x x x

1 RAM 128 × 8 256 × 8 2 1 2 7 1 2 10

q is 1 always for interfaces.

Component Hexadecimal Address Address Bus

RAM2 0200 027F 0 0 1 x x x x x x x

ROM1.1 0400 05FF 0 1 0 x x x x x x x x x

ROM1.2 0400 05FF 0 1 0 x x x x x x x x x

ROM2.1 0600 07FF 0 1 1 x x x x x x x x x

ROM2.2 0600 07FF 0 1 1 x x x x x x x x x

Interface1 0800 08FF 1 0 0 x x x x x x x x

Interface2 0A00 0AFF 1 0 1 x x x x x x x x

• Before moving into cache some discussion related to the

• The program and data can be stored together or separately.

• The memory used may be ROM or RAM. ROM is more compact

• To reduce the time needed to access the memory, a local copy

Receive Address from CPU

Is No Access Main Memory for

Load main Memory Deliver block

CACHE MAPPING TECHNIQUES

TAG & VALID BIT

HOW BIG THE CACHE IS?

MEMORY SYSTEM PERFORMANCE

AVERAGE MEMORY ACCESS TIME

DATA PLACEMENT WITHIN A BLOCK

LOCATING DATA WITHIN CACHE

DISADVANTAGES OF DIRECT MAPPING

FULLY ASSOCIATIVE MAPPING

SET ASSOCIATIVE MAPPING

LOCATING SET ASSOCIATIVE

Cache write techniques

Cache impact on system performance

Cache performance trade-offs