Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

ECE 554 Computer Architecture Main Memory Spring 2013

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

ECE 554 Computer Architecture

Lecture 5
Main Memory
Spring 2013
Sudeep Pasricha
Department of Electrical and Computer Engineering
Colorado State University
Pasricha; portions: Kubiatowicz, Patterson, Mutlu, Binkert, Elsevier

Main Memory Background


Performance of Main Memory:
Latency: Cache Miss Penalty
Access Time: time between request and word arrives
Cycle Time: time between requests
Bandwidth: I/O & Large Block Miss Penalty (L2)

Main Memory is DRAM: Dynamic Random Access Memory


Dynamic since needs to be refreshed periodically (8 ms, 1% time)
Addresses divided into 2 halves (Memory as a 2D matrix):
RAS or Row Address Strobe
CAS or Column Address Strobe

Cache uses SRAM: Static Random Access Memory


No refresh (6 transistors per bit vs. 1 transistor + 1 capacitor per bit)
Size: SRAM/DRAM 4-8,
Cycle time: DRAM/SRAM 8-16

Page 1

Memory subsystem organization


Memory subsystem organization
Channel
DIMM
Rank
Chip
Bank
Row/Column

Memory subsystem
Channel

DIMM (Dual in-line memory module)

Processor

Memory
channel

Memory
channel

Page 2

Breaking down a DIMM


DIMM (Dual in-line memory
module)

Side view

Front of DIMM

Back of DIMM

Serial presence detect (SPD)


-

Stored in EEPROM on module


has info to configure mem controllers

Breaking down a DIMM


DIMM (Dual in-line memory
module)

Side view

Front of DIMM

Rank 0: collection of 8 chips

Page 3

Back of DIMM

Rank 1

Rank

Rank 0 (Front)

Rank 1 (Back)

<0:63>

Addr/Cmd CS <0:1>

<0:63>

Data <0:63>

Memory channel

DIMM & Rank (from JEDEC)

Page 4

Chip 7
<56:63>

Chip 1

...

<8:15>

<0:63>

<0:7>

Rank 0

Chip 0

Breaking down a Rank

Data <0:63>

Bank 0
<0:7>

<0:7>

Chip 0

Breaking down a Chip

<0:7>

...

<0:7>

<0:7>

Page 5

Breaking down a Bank


2kB
1B (column)
row 16k-1

...

Bank 0

<0:7>

row 0

Row-buffer

1B

...

1B

<0:7>

1B

Example: Transferring a cache block


Physical memory space
0xFFFFF

...

Channel 0

DIMM 0

0x40
64B

Rank 0

cache block
0x00

Page 6

Example: Transferring a cache block


Physical memory space
Chip 0

Chip 1

0xFFFFF

Rank 0

Chip 7

<56:63
>

<0:7>

<8:15>

...

...

0x40
64B

Data <0:63>

cache block
0x00

Example: Transferring a cache block


Physical memory space
Chip 0

Chip 1

0xFFFFF

Rank 0

Chip 7

...
Row 0

<56:63>

<8:15>

<0:7>

...

Col 0

0x40
64B

Data <0:63>

cache block
0x00

Page 7

Example: Transferring a cache block


Physical memory space
Chip 0

Chip 1

Rank 0

0xFFFFF

Chip 7

...
Row 0

<0:7>

<8:15>

<56:63>

...

Col 0

0x40
64B
0x00

8B

Data <0:63>

cache block
8B

Example: Transferring a cache block


Physical memory space
Chip 0

Chip 1

0xFFFFF

Rank 0

Chip 7

...
Row 0

<56:63>

<8:15>

<0:7>

...

Col 1

0x40
64B
0x00

8B

Data <0:63>

cache block

Page 8

Example: Transferring a cache block


Physical memory space
Chip 0

Chip 1

Rank 0

0xFFFFF

Chip 7

...
Row 0

<0:7>

<8:15>

<56:63>

...

Col 1

0x40
8B

0x00

8B

64B

Data <0:63>

cache block
8B

Example: Transferring a cache block


Physical memory space
Chip 0

Chip 1

0xFFFFF

Rank 0

Chip 7

...

Row 0

<56:63
>

<8:15>

<0:7>

...

Col 1

0x40
8B

0x00

8B

64B

Data <0:63>

cache block

A 64B cache block takes 8 I/O cycles to transfer.


During the process, 8 columns are read sequentially.

Page 9

DRAM Overview

19

DRAM Architecture

bit lines
Col.
2M

Col.
1

N+M

Row 1

Row Address
Decoder

word lines

Row 2N

Column Decoder &


Sense Amplifiers
Data

Memory cell
(one bit)

Bits stored in 2-dimensional arrays on chip

Modern chips have around 4 logical banks on each chip


each logical bank physically implemented as many smaller arrays
20

Page 10

1-T Memory Cell (DRAM)


row select

Write:
1. Drive bit line
2.. Select row

Read:
1. Precharge bit line to Vdd/2
2. Select row
bit
3. Storage cell shares charge with bitlines
Very small voltage changes on the bit line
4. Sense (fancy sense amp)
Can detect changes of ~1 million electrons
5. Write: restore the value

Refresh
1. Just do a dummy read to every cell.

21

SRAM vs. DRAM

22

Page 11

DRAM Operation: Three Steps


Precharge
charges bit lines to known value, required before next row access

Row access (RAS)


decode row address, enable addressed row (often multiple Kb in row)
Contents of storage cell share charge with bitlines
small change in voltage detected by sense amplifiers which latch
whole row of bits
sense amplifiers drive bitlines full rail to recharge storage cells

Column access (CAS)


decode column address to select small number of sense amplifier
latches (4, 8, 16, or 32 bits depending on DRAM package)
on read, send latched bits out to chip pins
on write, change sense amplifier latches. which then charge storage
cells to required value
can perform multiple column accesses on same row without another
row access (burst mode)
23

DRAM: Memory-Access Protocol

24

Page 12

DRAM Bank Operation

Rows

Row address 0
1

Commands

Columns
Row decoder

Access Address:
(Row 0, Column 0)
(Row 0, Column 1)
(Row 0, Column 85)
(Row 1, Column 0)

Row 01
Row
Empty
Column address 0
1
85

ACTIVATE 0
READ 0
READ 1
READ 85
PRECHARGE
ACTIVATE 1
READ 0

Row Buffer CONFLICT


HIT
!

Column mux

Data

DRAM: Basic Operation

26

Page 13

DRAM Read Timing (Example)


Every DRAM access
begins at:

RAS_L

The assertion of the RAS_L


2 ways to read:
early or late v. CAS

CAS_L

WE_L

256K x 8
DRAM

OE_L

DRAM Read Cycle Time

RAS_L
CAS_L
A

Row Address

Col Address

Junk

Row Address

Col Address

Junk

WE_L
OE_L
D

High Z
Junk
Read Access
Time

Data Out

Early Read Cycle: OE_L asserted before CAS_L

High Z
Output Enable
Delay

Data Out

Late Read Cycle: OE_L asserted after CAS_L

27

DRAM: Burst

28

Page 14

DRAM: Banks

29

DRAM: Banks

30

Page 15

2Gb x8 DDR3 Chip [Micron]

Observe: bank organization


31

Quest for DRAM Performance


1. Fast Page mode
Add timing signals that allow repeated accesses to row buffer
without another row access time
Such a buffer comes naturally, as each array will buffer 1024 to
2048 bits for each access

2. Synchronous DRAM (SDRAM)


Add a clock signal to DRAM interface, so that the repeated
transfers would not bear overhead to synchronize with DRAM
controller

3. Double Data Rate (DDR SDRAM)


Transfer data on both the rising edge and falling edge of the
DRAM clock signal doubling the peak data rate
DDR2 lowers power by dropping the voltage from 2.5 to 1.8
volts + offers higher clock rates: up to 400 MHz
DDR3 drops to 1.5 volts + higher clock rates: up to 800 MHz
DDR4 drops to 1-1.2 volts + higher clock rates: up to 1600 MHz

32

Page 16

1. Fast Page Mode Operation


Regular DRAM Organization:

Column
Address

N rows x N column x M-bit


Read & Write M-bit at a time
Each M-bit access requires
a RAS / CAS cycle

DRAM

N x M SRAM to save a row

After a row is read into the


register
Only CAS is needed to access
other M-bit blocks on that row
RAS_L remains asserted while
CAS_L is toggled

Row
Address

N rows

Fast Page Mode DRAM

1st M-bit Access

N cols

N x M SRAM
M bits
M-bit Output

2nd M-bit

3rd M-bit

4th M-bit

Col Address

Col Address

Col Address

RAS_L
CAS_L
A

Row Address

Col Address

34

2. SDRAM timing (Single Data Rate)

CAS
RAS
(New Bank)

CAS Latency

Precharge
Burst
READ

Micron 128M-bit dram (using 2Meg16bit4bank ver)


Row (12 bits), bank (2 bits), column (9 bits)
35

Page 17

3. Double-Data Rate (DDR2) DRAM


200MHz
Clock

Row

Column

Precharge

Row

Data
[ Micron, 256Mb DDR2 SDRAM datasheet ]

400Mb/s
Data Rate
36

Memory Organizations

Page 18

Memory Organizations

Graphics Memory

Achieve 2-5 X bandwidth per DRAM vs. DDR3


Wider interfaces (32 vs. 16 bit)
Higher clock rate
Possible because they are attached via soldering
instead of socketted DIMM modules
E.g. Samsung GDDR5
2.5GHz, 20 GBps bandwidth

Page 19

40

DRAM Power: Not always up, but

41

DRAM Modules

42

Page 20

DRAM Modules

43

A 64-bit Wide DIMM (physical view)

44

Page 21

A 64-bit Wide DIMM (logical view)

45

DRAM Ranks

46

Page 22

Multiple DIMMs on a Channel

47

Fully Buffered DIMM (FB-DIMM)


DDR Problem
Higher capacity more DIMMs lower data rate (multidrop bus)

FB-DIMM approach: use point to point links


introduces an advanced memory buffer (AMB) between memory
controller and memory module
Serial interface between mem controller and AMB
enables an increase to the width of the memory without increasing
the pin count of the memory controller

48

Page 23

FB-DIMM challenges

As of Sep 2006, AMD has taken FB-DIMM off their roadmap


In 2007 it was revealed that major memory manufacturers have
no plans to extend FB-DIMM to support DDR3 SDRAM
Instead, only registered DIMM for DDR3 SDRAM had been demonstrated
In normal registered/buffered memory, only the control lines are buffered
whereas in fully buffered memory, the data lines are buffered as well
Both FB and registered options increase latency and are costly

49

Intel Scalable Memory Buffer

50

Page 24

DRAM Channels

52

DRAM Channels

53

Page 25

DRAM Channel Options

54

Multi-CPU (old school)

55

Page 26

NUMA Topology (modern)

56

Memory Controller

57

Page 27

DRAM: Timing Constraints

58

DRAM: Timing Constraints

59

Page 28

Latency Components:
Basic DRAM Operation

500 MHz DDR = 1000 MT/s

60

DRAM Addressing

61

Page 29

DRAM Controller Functionality

62

A Modern DRAM Controller

63

Page 30

Row Buffer Management Policies

68

DRAM Controller Scheduling Policies (I)

70

Page 31

DRAM Controller Scheduling Policies (II)

71

DRAM Refresh (I)

72

Page 32

DRAM Refresh (II)

73

DRAM Controllers are Difficult to Design

74

Page 33

DRAM Power Management

75

DRAM Reliability
DRAMs are susceptible to soft and hard errors
Dynamic errors can be
detected by parity bits
usually 1 parity bit per 8 bits of data
detected and fixed by the use of Error Correcting Codes (ECCs)
E.g. SECDED Hamming code can detect two errors and correct a
single error with a cost of 8 bits of overhead per 64 data bits

In very large systems, the possibility of multiple


errors as well as complete failure of a single memory
chip becomes significant
Chipkill was introduced by IBM to solve this problem
Similar in nature to the RAID approach used for disks
Chipkill distributes data and ECC information, so that the complete
failure of a single memory chip can be handled by supporting the
reconstruction of the missing data from the remaining memory chips
IBM and SUN servers and Google Clusters use it
Intel calls their version SDDC
76

Page 34

Looking Forward
Continued slowdown in both density and access time of
DRAMs new DRAM that does not require a capacitor?
Z-RAM prototype from Hynix

MRAMs use magnetic storage of data; nonvolatile


PRAMs phase change RAMs (aka PCRAM, PCME)
use a glass that can be changed between amorphous and crystalline
states; nonvolatile

77

Page 35

You might also like