9-Memory Design (Module4) - 18-Dec-2019Material - I - 18-Dec-2019 - Module - 4A - Memory - Design
9-Memory Design (Module4) - 18-Dec-2019Material - I - 18-Dec-2019 - Module - 4A - Memory - Design
9-Memory Design (Module4) - 18-Dec-2019Material - I - 18-Dec-2019 - Module - 4A - Memory - Design
MEMORY DESIGN
Module 4
2
Memory Design
• Available Memory chip Size MN, W: N × W
= W1 / W
3
Memory design
There are 3 types of organizations of N1 × W1 that can
be formed using N × W
• N1 = N and W1 > W => increasing the word size of the
chip
• N1 > N and W1 = W => increasing the number of
words in the memory
• N1 > N and W1 > W => increasing both the number of
words and number of bits in each word.
4
Memory design – Increasing the word size
• Problem - 1
• Design 128 × 16 - bit RAM using 128 × 4 - bit RAM
• Solution: p = 128 / 128 = 1; q = 16 / 4 = 4
• Therefore, p × q = 1 × 4 = 4 memory chips of size 128 × 4 are required to
construct 128 × 16 bit RAM
From To 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
RAM 1.1 0000 007F 10x 10x 10x 10x 10x 10x 10x
Data Bus
16
4 4 4 4
Data (0-3) Data (0-3) Data (0-3) Data (0-3)
Read/write Control
7
Memory Design
Data r/w
6-0
16
7
4 4 4 4
1
8
S.NO P q p*q x
Memory NxW N1 x W1 y z Total
1024
1 RAM 256 × 8 4 1 4 8 2 0 10
×8
2
3
4
9
From To 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
RAM 1 0000 00FF 0 0 1x0 10x 10x 10x 10x 10x 10x 10x
Data
Address Bus 256 × 8 Bus
0 8 RAM 8
1
A8 CS R/W
2×4 2
A9 decoder
Data
3 Address Bus 256 × 8 Bus
8 RAM 8
CS R/W
Data
Address Bus 256 × 8 Bus
8 RAM 8
CS R/W
8 R/W
Data
Bus
11
256 × 8
987-0 RAM 1 Data r/w
8
8
256 × 8
RAM 2
8
2×4
Decoder
256 × 8
3 2 1 0 RAM 3
256 × 8
RAM 4
8
12
Memory Design
• Problem - 3
• Design 256 × 16 – bit RAM using 128 × 8 – bit RAM chips
S.NO P q p*q x
Memory NxW N1 x W1 y z Total
4
13
Address Bus
128 × 8 128 × 8
RAM 1.1 RAM 1.2
1×2
Decoder
1 0
16
8
8
128 × 8 128 × 8
RAM 2.1 RAM 2.2
16
8
8
14
Memory Design
• Problem - 4
• Design 256 × 16 – bit RAM using 256 × 8 – bit RAM chips and
256 × 8 – bit ROM using 128 × 8 – bit ROM chips.
S.NO P q p*q x
Memory NxW N1 x W1 y z Total
256 × 256 ×
1 RAM 1 2 2 8 0 1 9
8 16
128 ×
2 Rom 256 × 8 2 1 2 7 1 1 9
8
3
4
15
From To 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1×2 1×2 8
Decoder Decoder
1 0 1 0
128 × 8
ROM 2
256 × 8 256 × 8
RAM 1.1 RAM 1.2
16
8
8
17
Memory design
Problem – 5
A computer employs RAM chips of 128 x 8 and ROM chips of
512 x 8. The computer system needs 256 bytes of RAM, 1024 x
16 of ROM, and two interface units with 256 registers each. A
memory mapped I/O configuration is used. The two higher -order
bits of the address bus are assigned 00 for RAM, 01 for ROM,
and 10 for interface registers.
a) Compute total number of decoders are needed for the above
system?
b) Design a memory-address map for the above system
c) Show the chip layout for the above design
18
Requirements
S.NO p*q x
Memory NxW N 1 x W1 P q y z Total
From To 15 - 12 11 1 9 8 7 6 5 4 3 2 1 0
0
RAM1 0000 007F 0 0 0 x x x x x x x
CACHE IN EMBEDDED
SYSTEM
22
CACHE MEMORY
Fast/expensive technology, usually on the same chip
Processor
Cache
Memory
Slower/cheaper technology, usually on a different
chip
28
Types of Locality
• Types of locality
• Temporal locality
• recently accessed items are likely to be accessed in the near future.
• Spatial locality
• items whose addresses are near one another tend to be referenced close
together in time.
29
Start
Yes
Select the cache line to receive
the block from Main Memory
Deliver Block To CPU
Done
31
Direct Mapping
Block
Management Set Associative
Fully Associative
Tag
Block
Index
Identification
Offset
Cache Memory
Management Techniques FCFS
Block
LRU
Replacement
Random
Write Through
Write Back
Update Policies
Write Around
Write Allocate
32
Example
34
Direct Mapping
15
14
14
13
7
12
6
11
5
10
4
9
3
8
2
7
1
6
0
5
4
3 Cache
2
1 (MM Block address) mod (Number of lines in a cache)
0
(12) mod (8) =4
Main Memory
35
DIRECT MAPPING
36
EXAMPLE
41
SPATIAL LOCALITY
42
BLOCK ADDRESS
43
CACHE MAPPING
44
USING ARITHMETIC
50
LARGER CACHE
51
EXAMPLE
52
53
EXAMPLE
60
BLOCK REPLACEMENT
61
62
63
Cache-replacement policy
• Technique for choosing which block to replace
• when fully associative cache is full
• when set-associative cache’s line is full
• Direct mapped cache has no choice
• Random
• replace block chosen at random
• LRU: least-recently used
• replace block not accessed for longest time
• FIFO: first-in-first-out
• push block onto queue when accessed
• choose block to replace by popping queue
64
0.16
0.14
0.12
% cache 0.1 1 way
0.08 2 way
miss 4 way
0.06 8 way
0.04
0.02
0 cache size
1 Kb 2 Kb 4 Kb 8 Kb16 Kb32 Kb64 Kb128 Kb
Comparison of Cache Mapping Techniques
• There is a critical trade-off in cache performance that has led to
the creation of the various cache mapping techniques described
in the previous section. In order for the cache to have good
performance you want to maximize both of the following:
• Hit Ratio: You want to increase as much as possible the
likelihood of the cache containing the memory addresses that
the processor wants. Otherwise, you lose much of the benefit of
caching because there will be too many misses.
• Search Speed: You want to be able to determine as quickly as
possible if you have scored a hit in the cache. Otherwise, you
lose a small amount of time on every access, hit or miss, while
you search the cache.
67
Comparison of Cache Mapping Techniques…..
Cache Type Hit Ratio Search Speed
Direct Mapped Good Best
Fully Associative Best Moderate
N-Way Set Very Good, Good, Worse as
Associative, N>1 Better as N N Increases
Increases
68
69
BASIC ARCHITECTURE
70
BASIC ARCHITECTURE
• Control unit and data- Processor
path Control unit Datapath
Datapath Operations
• Load
Processor
• Read memory location Control unit Datapath
into register
ALU
• ALU operation Controller Control +1
– Input certain registers /Status
I/O
...
Memory
10
...
11
72
Control Unit
• Control unit: configures the datapath
operations Processor
• Sequence of desired operations Control unit Datapath
(“instructions”) stored in memory –
“program” ALU
Controller Control
• Instruction cycle – broken into several
/Status
sub-operations, each one clock cycle,
e.g.: Registers
• Fetch: Get next instruction into IR
• Decode: Determine what the instruction
means
• Fetch operands: Move data from PC IR R0 R1
memory to datapath register
• Execute: Move data through the ALU
• Store results: Write data from register to I/O
memory ...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
73
IR ALU
Controller
• PC: program counter, Control
/Status
always points to next
instruction Registers
I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
74
Registers
PC 100 IR R0 R1
load R0, M[500]
I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
75
Registers
10
PC 100 IR R0 R1
load R0, M[500]
I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
76
ALU ALU
Controller
• This particular instruction Control
/Status
does nothing during this
sub-operation Registers
10
PC 100 IR R0 R1
load R0, M[500]
I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
77
memory ALU
Controller
• This particular instruction Control
/Status
does nothing during this
sub-operation Registers
10
PC 100 IR R0 R1
load R0, M[500]
I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
78
Instruction Cycles
PC=100 Processor
Registers
10
PC 100 IR R0 R1
load R0, M[500]
I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
79
Instruction Cycles
PC=100 Processor
PC=101
Registers
Fetch Decode Fetch Exec. Store
ops results
clk
10 11
PC 101 IR R0 R1
inc R1, R0
I/O
...
100 load R0, M[500] Memory
500 10
101 inc R1, R0
102 store M[501], R1
501 ...
80
Instruction Cycles
PC=100 Processor
PC=101
Registers
Fetch Decode Fetch Exec. Store
ops results
clk
10 11
PC 102 IR R0 R1
store M[501], R1
PC=102
Fetch Decode Fetch Exec. Store I/O
ops results ...
100 load R0, M[500] Memory
clk 500 10
101 inc R1, R0
102 store M[501], R1
...
501 11
81
Architectural Considerations
• N-bit processor Processor
• N-bit ALU, registers, buses, Control unit Datapath
even 64
• PC size determines
PC IR
address space
I/O
Memory
82
Architectural Considerations
• Clock frequency Processor
• Inverse of clock period Control unit Datapath
longest
PC IR
I/O
Memory
83
84
DSP Architecture
87
ARM Architecture
88
CISC Feature
• Complex instruction set computer
• Large number of instructions (~200-300 instructions)
• Specialized complex instructions
• Many different addressing modes
• Variable length instruction format
• For Example : 68000, 80x86
89
RISC Feature
• Reduced instruction set computer
• Relatively few number of instructions (~50)
• Basic instructions
• Relatively few different addressing modes
• Fixed length instruction format
• Only load/store instructions can access memory
• Large number of registers
• For Example : MIPS, Alpha, ARM etc.
90
CISC vs RISC
• CISC -- High Code Density
• Fewer instructions needed to specify the algorithm
• RISC -- Simpler to Design
• Higher Performance
• Lower power consumption
• Easier to develop compilers to take advantage of all features
91
CISC vs RISC
• CISC -- High Code Density
• Fewer instructions needed to specify the algorithm
• RISC -- Simpler to Design
• Higher Performance
• Lower power consumption
• Easier to develop compilers to take advantage of all features
92
CISC
• Intel: 80x86
• Motorola: 680x0
RISC
• Sun : Sparc
• Silicon Graphics : MIPS
• HP : PA-RISC
• IBM: Power PC
• Compaq: Alpha
93
DMA
• Direct Memory Access: a bus operation not controlled
by CPU
• Controlled by DMA controller (a bus master)
• 2 additional wires
• Bus request & Bus grant
bus grant
DMAC Device
bus request
CPU
Memory
94
PIPELINE
26
Pipelining
Laundry Example
A B C D
• A, B, C, D each have one load of
clothes to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
27
What Is Pipelining
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
C
d
e
r D
What Is Pipelining
Start work ASAP
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
T
a A
s • Pipelined laundry takes 3.5
k hours for 4 loads
B
O
r
C
d
e
r D
98
Pipelining
• A technique of decomposing a sequential process into sub operations,
with each sub process being executed in a partial dedicated segment that
operates concurrently with all other segments.
PIPELINE SPEEDUP
Consider a non pipeline unit that performs same operation and
takes a time equal to tp.
PIPELINE SPEEDUP
•
tn
lim Sk = ( = k, if tn = k * tp )
n tp
107
108
• There are two areas in the computer design where the pipeline organization is
applicable.
i. Arithmetic pipeline:
It divides the arithmetic operations into sub operations for execution in the
pipeline.
ii. Instruction pipeline:
It operates on the stream of instructions by overlapping the fetch, decode &
execute phase of instruction cycle.
109
ARITHMETIC PIPELINE
• The arithmetic pipeline is used to implement floating point
operations, multiplication of fixed point numbers.
• Lets take an example of floating point operation where the operation
is decomposed in to sub operations.
110
ARITHMETIC PIPELINE
111
• Suppose the time delays of the 4 segments are t1=60ns , t2=70ns, t3=
100ns, t4=80ns, and the interface register have tr=10ns.
• The clock cycle tp is chosen to be tp= t3+tr= 100+10=110ns.
• An equivalent non pipeline floating point adder- subtractor will have
delay time of tn= t1+t2+t3+t4=320ns.
• Speed up is of 320/110=2.9 over non pipelined adder.
116
INSTRUCTION PIPELINE
• Pipeline process occurs not only in data stream but also in instruction
stream as well.
• Instruction pipeline reads consecutive instructions from the memory
while the previous instruction was being executed in other segment.
• The problem occurs in branch instruction . In that case the pipeline
must be emptied and the instruction after the pipeline must be
discarded.
117
• The certain difficulties that will prevent the pipeline from operating at
the maximum rate.
• Different segment may take different time to operate on the incoming
information.
• Two or more segments may require memory access at the same time,
causing one segment to wait until another is finished with the memory.
119
• But it can be resolved by going for two memory bus for accessing
instruction and data in a separate memory.
• The design of pipeline will be efficient if the instruction cycle is divided
into segments of equal duration.
120
4 STAGE PIPELINE
1 FI: Fetch an instruction from memory
2 DA: Decode the instruction and calculate
the effective address of the operand
3 FO: Fetch the operand
4 EX: Execute the operation & store the result.
121
122
TIMING OF INSTRUCTION
PIPELINE
124
PIPELINE HAZARDS
• STRUCTURAL HAZARD
• DATA HAZARD
• CONTROL HAZARD
128
STRUCTURAL HAZARD
• Occur when some resource has not been duplicated enough to allow all
combinations of instructions in the pipeline to execute
• Example: With one memory-port, a data and an instruction fetch
cannot be initiated in the same clock
130
131
DATA HAZARD
• An instruction scheduled to be executed in the pipeline requires the result
of the previous instruction which has not arrived.
• E.g. ADD R1, R2, R3
SUB R4, R1, R5
• Data hazard can be dealt with either hardware techniques or software
technique
132
HARDWARE TECHNIQUE
Interlock
hardware detects the data dependencies and delays the scheduling of the
dependent instruction by stalling enough clock cycles
Forwarding (bypassing, short-circuiting)
Accomplished by a data path that routes a value from a source (usually an ALU) to
a user, bypassing a designated register. This allows the value to be produced to be
used at an earlier stage in the pipeline than would otherwise be possible
133
FORWARDING HARDWARE
134
INSTRUCTION SCHEDULING
135
CONTROL HAZARDS