The Dynamic Granularity Memory System
Doe Hyun Yoon†
doe-hyun.yoon@hp.com
Min Kyu Jeong‡
mkjeong@utexas.edu
Michael Sullivan‡
mbsullivan@utexas.edu
Mattan Erez‡
mattan.erez@mail.utexas.edu
†Intelligent Infrastructure Lab
Hewlett-Packard Labs
‡Department of Electrical and Computer Engineering
The University of Texas at Austin
Abstract
100%
75%
50%
Chip multiprocessors enable continued performance
scaling with increasingly many cores per chip. As the
throughput of computation outpaces available memory
bandwidth, however, the system bottleneck will shift to main
memory. We present a memory system, the dynamic granularity memory system (DGMS), which avoids unnecessary
data transfers, saves power, and improves system performance by dynamically changing between fine and coarsegrained memory accesses. DGMS predicts memory access
granularities dynamically in hardware, and does not require software or OS support. The dynamic operation of
DGMS gives it superior ease of implementation and power
efficiency relative to prior multi-granularity memory systems, while maintaining comparable levels of system performance.
1. Introduction
With continued device scaling, off-chip memory increasingly becomes a system bottleneck: performance is
constrained as the throughput of computation outpaces
available memory bandwidth [17]; large, high-density
DRAMs and memory traffic contribute significantly to system power [14]; and shrinking feature and growing memory
sizes make reliability a more serious concern [33]. Existing systems attempt to mitigate the impact of the memory
bottleneck by using coarse-grained (CG) memory accesses.
CG accesses reduce miss rates, amortize control for spatially local requests, and enable low-redundancy error tolerance.
c 2012 IEEE. Personal use of this material is permitted. Permission from IEEE
must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new
collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
25%
0%
8 words
5-7 words
2-4 words
1 word
Figure 1: Number of touched 8B words in a 64B cache line before the line is
evicted.
When a program lacks spatial locality, CG accesses
waste power, memory bandwidth, and onchip storage resources. Figure 1 shows the spatial locality of various
benchmarks by profiling the number of 8B words accessed in each 64B cache line before the line is evicted.
Most applications touch less than 50% of each cache line,
and a CG-only memory system wastes off-chip bandwidth
and power for fetching unused data. A memory system
that makes only a fine-grained (FG) access eliminates this
minimum-granularity problem and may achieve higher system throughput than a CG-only memory system. An FGonly memory system, however, incurs high ECC (error
checking and correcting) overhead since every FG data
block needs its own ECC. High-end vector processors (e.g.,
Cray’s Black Widow [3]) often use the FG-only approach
but squander the benefits of CG accesses when spatial locality is high (e.g., OCEAN, streamcluster, hmmer,
and STREAM in Figure 1).
Previous work presents a memory system with tunable
memory access granularity [41]. The adaptive granularity
memory system (AGMS) enables the processor to selectively use FG accesses only when beneficial and still maintain the efficiency of CG accesses by default. AGMS is
a software-hardware collaborative technique that allows the
programmer or software system to indicate the desired granularity for each memory page. To enable memory error protection, AGMS uses different data / ECC layouts for CG
and FG pages and requires a virtual memory mechanism to
communicate this information to hardware.
2. Adaptive Granularity Memory System
Neither CG-only nor FG-only main memory systems are
ideal for all applications. A CG-only memory system increases cache hit rates, amortizes control overheads, and
benefits from low-redundancy error-control codes for applications with high spatial locality. Many applications, however, exhibit poor spatial locality due to non-unit strides,
indexed gather/scatter accesses, and other complex access
patterns [29, 27, 34]. For applications with low spatial lo-
DBUS (64-bit data + 8-bit ECC)
ABUS
Reg/Demux
We extend AGMS with dynamic mechanisms that offer numerous, substantial benefits. We refer to the resulting system as the Dynamic Granularity Memory System
(DGMS). DGMS supports both CG and FG accesses to a
single, uniform memory space. Eliminating the strict separation of CG and FG accesses enables true dynamic adaptivity and has the potential to increase the overall utility and
to simplify the implementation of an AGMS system.
The data layout for DGMS shares the same memory,
including ECC, between CG and FG memory accesses.
This allows FG accesses to benefit from the same lowredundancy error tolerance as CG accesses, eliminating the
100% FG ECC overhead required for the original AGMS
design [41]. This reduction in error protection overheads
affects the capacity, bandwidth, and power efficiency of FG
accesses.
Because the layout proposed for DGMS permits pages to
service CG and FG accesses simultaneously, it enables the
prediction of access granularities without complicated virtual memory mechanisms. Dynamic locality and granularity speculation allows DGMS to operate as a hardware-only
solution, without application knowledge, operating system
(OS) support, or the need for programmer intervention.
DGMS modifies previously proposed spatial pattern predictors to operate at the main memory interface of a multicore CPU. This study shows dynamic granularity adjustment to be an effective method for improving performance
and system efficiency. DGMS with dynamic spatial locality
prediction provides comparable performance to softwarecontrolled AGMS and demonstrates superior DRAM traffic and power reduction capabilities. Overall, DGMS improves average system throughput by 31% and reduces offchip traffic by 44% and DRAM power by 13%. In addition,
DGMS allows the granularity of a cache line to change with
program phases, resulting in a more flexible and effective
access granularity policy.
The rest of this paper is organized as follows: we
briefly review prior work on adaptive granularity in Section 2, present DGMS in Section 3, describe the evaluation
methodology in Section 4, evaluate DGMS in Section 5,
summarize related work in Section 6, discuss design issues
and future work in Section 7, and conclude our study in Section 8.
X8
x8
x8
x8
x8
x8
x8
x8
x8
SR0
SR1
SR2
SR3
SR4
SR5
SR6
SR07
SR8
Figure 2: Sub-ranked memory with register/demux circuitry.
cality, an FG-only memory system avoids unnecessary data
transfers and utilizes off-chip bandwidth more efficiently.
However, an FG-only memory system requires high ECC
overhead and squanders the benefits of a CG access in programs with high spatial locality.
AGMS is a previously proposed memory system which
combines favorable qualities from both FG and CG accesses. AGMS requires changes to (and collaboration
between) all system levels, from the memory system to
userspace applications. Some implementation details follow; in addition, AGMS requires OS support to track and
propagate page granularity information and mixed granularity memory scheduling at the memory controller. We refer
the reader to [41] for more details.
2.1. Application level interface
Enabling memory protection in AGMS requires different
memory protection schemes for different granularities (see
Section 2.4 for details). Consequently, the processor cannot
adapt the granularity of memory without software support.
AGMS allows the programmer or the software system to
dictate the granularity of each page. This information is
communicated through a set of annotations, hints, compiler
options, and defaults that associate a specific access granularity with every virtual memory page or segment.
2.2. Cache hierarchy
AGMS, with its mixed granularity support, needs to
manage both CG and FG data blocks within the cache hierarchy. AGMS uses a sector cache [22]; each 64B cache
line has eight 8B subsectors to manage 8B FG data blocks
within the cache hierarchy. A sector cache does not increase
address tag overhead but adds some storage overheads for
additional valid and dirty bits (14 bits per 64B cache line).
2.3. Main memory
Main memory uses commodity DDRx DRAM devices.
Since most current systems use CG-only memory accesses,
DDRx memory has evolved to enable high data transfer
rates by increasing the minimum access granularity. The
minimum access granularity is the product of burst length
and channel width.
The burst length of a memory access is dictated by
DRAM technology and cannot be changed by system designers. While high density and low cost DRAM designs
limit DRAM operating speeds, effective I/O data rates have
Burst 8
64-bit data + 8-bit ECC (SEC-DED)
B0
B1
B2
B3
B4
B5
B6
B7
E 0-7
B8
B9
B10
B11
B12
B13
B14
B15
E 8-15
B16
B24
B32
B40
B48
B56
B17
B25
B33
B41
B49
B57
B18
B26
B34
B42
B50
B58
B19
B27
B35
B43
B51
B59
B20
B28
B36
B44
B52
B60
B21
B29
B37
B45
B53
B61
B22
B30
B38
B46
B54
B62
B23
B31
B39
B47
B55
B63
E 16-23
E 24-31
E 32-39
E 40-47
E 48-55
E 56-63
Core
0
SPP
LPC
$D
Core
0
$I
L2
(a) Coarse-grained (Bx represents the x-th byte in a 64B block, and Ey-z is 8-bit SECDED ECC for data By to Bz.)
B0
B1
E0
E1
B2
B3
B4
B5
B6
B7
E2
E3
E4
E5
E6
E7
B8
B9
B10
B11
B12
B13
B14
B15
E8
E9
E10
E11
E12
E13
E14
E15
E16
E17
B24
B25
E24
E25
B18
B19
B20
B21
B22
B23
E18
E19
E20
E21
E22
E23
B26
B27
B28
B29
B30
B31
E26
E27
E28
E29
E30
E31
(b) Fine-grained (Bx represents the x-th byte in a 64B block, and Ex is 8-bit SEC-DED
ECC for data Bx.)
Figure 3: CG and FG accesses in AGMS [41].
increased throughout DRAM generations. This increase in
transfer bandwidth is achieved by employing an n-bit burst
access: n is 1 in SDRAM, 2 in DDR, 4 in DDR2, and 8
in DDR3. As a result, the minimum access granularity in
a typical 64-bit wide DRAM channel is increasing: 8B in
SDRAM, 16B in DDR, 32B in DDR2, and 64B in DDR3.
To enable an FG access, AGMS leverages a recently proposed sub-ranked memory system that controls individual
DRAM devices within a rank; data access to/from a single
×8 DRAM device is as small as 8B with a burst-8 access in
DDR3. AGMS uses a sub-ranked memory system similar
to HP’s MC-DIMM (multi-core dual in-line memory module) [4, 5]. Figure 2 illustrates a sub-ranked memory system
with a register/demux that can control individual DRAM
devices (see Section 3.2 for more detail).
2.4. Data layout
AGMS uses different data / ECC layouts for CG and FG
pages. The size of ECC grows sub-linearly with the size of
the data it protects; hence, the finer the access granularity,
the larger the ECC overhead. Typically, a CG data block has
12.5% ECC storage overhead; 8-bit ECC (single bit-error
correct and double bit-error detect, or SEC-DED) for every
64-bit data. AGMS applies a similar error coding technique
to a FG data block (8B), requiring 100% ECC overhead –
5-bit ECC provides SEC-DED protection for the data, but
one entire DRAM burst out of a ×8 DRAM chip is needed
to access the ECC information.
Figure 3 compares the data layouts for CG and FG pages.
An FG access can achieve high throughput when spatial locality is low but increases the ECC overhead.
The proposed memory system (DGMS) extends AGMS
and allows it to act without OS or programmer support;
adaptive granularity is provided completely in hardware,
without a priori application knowledge or programmer intervention. DGMS uses a unified data / ECC layout that
…
L2
LPC
GPC
B16
B17
$I
SPP
LPC
$D
$I
L2
Last Level Cache
8-bit data + 5-bit SEC-DED or 8-bit DEC
Burst 8
$D
SPP
Core
0
Memory
Controller
Sub-ranked Memory
Figure 4: A Chip-Multiprocessor (CMP) architecture with DGMS.
permits a physical memory location to service both CG and
FG accesses simultaneously. This unified data layout enables the prediction of access granularities without complicated virtual memory mechanisms. Locality and granularity
speculation, in turn, allow DGMS to operate without external software or programmer support.
3. Dynamic Granularity Memory System
Figure 4 shows a chip-multiprocessor (CMP) architecture with DGMS; each core has a spatial pattern predictor (SPP) and a local prediction controller (LPC). In addition, a global prediction controller (GPC) at the memory controller adaptively tunes the local prediction results.
We describe the ECC scheme and DRAM data layout used
for DGMS in Section 3.1 and then detail its changes to the
AGMS memory system in Section 3.2. The specifics of spatial locality prediction and dynamic granularity adjustment
are described in Section 3.3.
3.1. Data layout
We encode the data within each 64B data chunk differently such that each 8-bit SEC-DED ECC protects the
8B transmitted out of a single DRAM chip over all bursts.
The eight bytes of DGMS ECC protect the full 64B data
chunk with the same redundancy overhead as the conventional CG-only system. Since each 8-bit SEC-DED ECC
protects an independent DRAM chip, the layout supports
both CG and FG accesses. Figure 5(a) illustrates how an
FG request is serviced with the proposed data layout.
Memory traffic with many independent accesses can
negatively impact the performance of the proposed data layout (Figure 5(a)) due to bank conflicts in the ECC DRAM
chip. In order to avoid such contention, we spread ECC
blocks across sub-ranks in a uniform, deterministic fashion,
similar to RAID-5 [28]. We use the residue modulo 9 of the
DRAM column address bits (next to the cache line offset)
for distributing ECC blocks across sub-ranks. We can implement the mod-9 residue generator using efficient parallel
designs of the form [2a + 1; a ∈ N] [37].
DRAM read: A CG read in DGMS is identical to that
of a conventional, CG-only system; the memory controller
64-bit data
8-bit ECC (SEC-DED)
V D
Burst 8
Tag
B0
B1
B8
B9
B16
B17
B24
B25
B32
B33
B40
B41
B48
B49
B56
B57
E 0-7
E 8-15
B2
B3
B4
B5
B6
B7
B10
B11
B12
B13
B14
B15
B18
B19
B20
B21
B22
B23
B26
B27
B28
B29
B30
B31
B34
B35
B36
B37
B38
B39
B42
B43
B44
B45
B46
B47
B50
B51
B52
B53
B54
B55
B58
B59
B60
B61
B62
B63
E 16-23
E 24-31
E 32-39
E 40-47
E 48-55
E 56-63
Cache Line
8B
8B
8B
8B
8B
ECC
8B
8B
8B
8B
8B
ECC
8B
8B
8B
8B
8B
8B
8B
ECC
8B
8B
8B
8B
8B
8B
8B
ECC
8B
8B
8B
8B
8B
8B
8B
ECC
8B
8B
8B
8B
8B
8B
8B
ECC
8B
8B
8B
8B
8B
8B
8B
ECC
8B
8B
8B
8B
8B
8B
8B
ECC
8B
8B
8B
8B
8B
8B
8B
ECC
8B
8B
SR 0
SR 1
SR 2
8B
8B
8B
8B
SR 3 SR 4
SR 5
SR 6
(b) Spreading ECC locations
8B
8B
SR 7
SR 8
Figure 5: The data layout used by DGMS to support multiple access granularities and the method used to lessen bank conflicts in the ECC DRAM chip.
fetches a 72B block including data and ECC. An FG read
in DGMS is different. The memory controller accesses two
DRAM chips: one for 8B data and the other for 8B ECC.
Unlike AGMS, this 8B ECC block can detect and correct
errors in other data words, which are potentially read later.
For this reason, we retain the ECC information of not-yetfetched words onchip in the invalid subsectors of each line.
Figure 6(a) illustrates how data words and ECC information are stored in a sectored cache line. In this example,
the memory controller fetches only 3 words (and ECC) from
DRAM, and the invalid subsectors store the ECC of the notyet-fetched data words. When a subsector miss occurs, the
L2 cache controller sends the cached ECC from the invalid
subsector to the memory controller along with the FG request. Thus, the memory controller fetches only data and
does error checking and correcting as usual, sourcing the
ECC from the cache rather than re-fetching it.
Compared to AGMS (every FG read has an associated
ECC block), DGMS can significantly reduce ECC traffic
when more than one word in a cache line is accessed. Note
that this mechanism does not change the cache coherence
mechanism and does not complicate cache management.
Invalid subsectors simply store ECC information for future
references, reducing ECC traffic.
DRAM write: An FG DRAM write-back updates an 8B
data block as well as an 8B ECC block. The memory controller must update the ECC with new information corresponding to the words being written, but should not change
the ECC information that corresponds to invalid subsectors
in the cache line being written back. Figure 6(b) shows how
ECC for valid words is encoded, combined with the cached
ECC of invalid subsectors, and written out to DRAM. If
a dirty write-back has only a few dirty words, but the lo-
V D
ECC
V D
ECC
V D
ECC
V D
V D
Data
ECC
DEC
ECC
ECC
DEC
D
x8
x8
x8
x8
x8
x8
x8
x8
D
x8
Reg/Demux
8B
8B
ECC
DEC
ECC
(a) Read
V D
Tag
V D
Data
Data
ECC
ENC
ECC
ENC
x8
x8
V D
ECC
V D
ECC
V D
ECC
V D
ECC
V D
Data
V D
ECC
Cache Line
ECC
ENC
ABUS
D
D
x8
x8
x8
x8
x8
x8
x8
Reg/Demux
8B
8B
V D
Data
ABUS
(a) Proposed data layout (Bx represents the x-th byte in a 64B block, and Ey-z is 8-bit
SEC-DED ECC for data By to Bz.)
8B
V D
Data
(b) Write
Figure 6: DRAM read/write examples in DGMS.
cal or global prediction control dictates a CG access (discussed in Section 3.3.2), then the memory controller uses
write masks to avoid overwriting unchanged or unfetched
data words in DRAM.
3.2. Main memory
We use a sub-ranked memory system similar to MCDIMM [4, 5] to enable FG memory accesses. This subranked memory system places a register/demux to control
each DRAM chip independently, providing 8B access granularity with DDR3 burst-8 accesses.
In order to maximize data bus (DBUS) utilization with
FG requests, both AGMS and DGMS use double data rate
signaling for increased address bus (ABUS) bandwidth.
Figure 7(a) illustrates the partitioned register/demux presented in AGMS [41], which statically separates sub-ranks
into multiple partitions. This partitioned register/demux
ABUS architecture works well for AGMS; an FG access,
fetching 8B data and 8B ECC, is served by one command to
two neighboring sub-ranks in the same static partition, and
the memory controller can issue two independent accesses
per cycle, one for each partition.
DGMS uses the unified data / ECC layout presented
in Section 3.1; an FG access can be served by any combination of two sub-ranks since ECC data can now be in any
sub-rank as in Figure 5(b). Such a layout is not a good fit
for the partitioned register/demux system used by AGMS.
Load/Store
Partition 0
SR0
SR1
ABUS
SR2
SR3
Clk
SR1
Tag Status
Data
Used
ABUS
Double
Data
Rate
Clk
SR3
..
.
..
.
..
.
SR7
SR7
Clk
SR8
SR8
Reg/Demux
Partition 1
(a) Partitioned reg/demux
L1 Data Cache
SR6
SR6
(b) Unconstrained reg/demux
Status Idx Pattern
..
.
..
.
..
.
..
.
Request To L2
01000000
00001000
00001110
01001100
..
.
11010001
01110000
Evicted or
Subsector miss
CPT
Pattern
SR5
SR5
Reg/Demux
DA
00110000
11010001
SR4
SR4
Clk
+
Idx
00101101
01001011
00001000
10000000
SR2
PHT
F
T
ModeLPC
Double
Data
Rate
PC
Update CPT
SR0
PHT hit
F
Demanded word
T
11111111
Avg. Ref. Word > 3.75
Figure 7: Sub-ranked DRAM (2× ABUS) with register/demux circuitry.
LPC
When DGMS data and its ECC fall in different partitions,
the memory controller must issue two separate commands
for one FG request, doubling ABUS bandwidth consumption.
To mitigate the inefficiency of partitioned register/demux
and to simplify scheduling, we use an unconstrained register/demux architecture, shown in Figure 7(b). This architecture is able to dispatch any two commands to disjoint
sub-ranks each cycle.
3.3. Dynamic granularity adjustment
The data layout described in Section 3.1 allows DGMS
to eliminate the strict separation between CG and FG pages.
This removes the need for virtual memory support for memory access granularity, making DGMS a hardware-only solution. Adjusting access granularity without software support significantly reduces the barrier to adopting DGMS in
actual designs.
We use a previously suggested hardware predictor that
identifies likely-to-be-referenced words within a cache
line [19, 10]. Since the prior spatial pattern predictors are
designed for a single core, we introduce a two-level prediction control mechanism, considering the potential interference among multiple cores and threads: a local prediction controller (LPC) in each core and a global prediction
controller (GPC) at the memory controller. Section 3.3.1
describes the details of spatial pattern prediction, and Section 3.3.2 illustrates the proposed two-level prediction control mechanism.
3.3.1. Spatial pattern predictor We use the spatial pattern
predictor (SPP) proposed by Chen et al. [10]. The SPP uses
a current pattern table (CPT) and a pattern history table
(PHT) for predicting likely-to-be-referenced word patterns
upon a cache miss. Figure 8 illustrates the organization of
an L1 data cache with the CPT and PHT.
Current pattern table: The CPT keeps track of which
words in each L1 cache line are referenced. A CPT entry
is composed of a bit vector, with a one indicating that the
corresponding word in the cache line was used (Used), and
an index into the pattern history table (Idx). The Used bit
Figure 8: SPP [10] and LPC.
vector is updated on every L1 data cache access and tracks
all words used in the cache line over its lifetime (from cache
fill to eviction).
When an L1 cache line is evicted, the associated CPT entry updates the PHT with the Used bit vector to enable prediction of future usage patterns. The Idx indicates the PHT
entry to be updated. We construct the Idx using the program
counter (PC) and the data address (DA) of the load/store instruction that originally triggered the cache fill. We use a
12-bit PHT Idx and calculate an Idx as follows.
Idx = 0xFFF &
L
(((PC >> 12) PC) << 3 + (0x7 & (DA >> 3)))
Pattern history table: The PHT is a cache-like structure
that maintains recently captured spatial locality information. Although Figure 8 describes the PHT as a directmapped structure, it can be of any associativity. We use a
small, 32-set 8-way set associative PHT (only 768B) in the
evaluation. This small PHT is sufficient because the PHT
tracks the pattern behavior of load/store instructions and
does not attempt to track the large number of cache lines
in the system. The PHT Idx, as shown above, is composed
mostly of PC bits with a few DA bits to account for different alignments (as discussed in [10]). The 12-bit Idx we use
can track 512 different memory instructions (assuming no
aliasing); this is sufficient for the applications we evaluate,
corroborating prior results [10, 19].
When a cache miss occurs, the PHT is queried to get the
predicted spatial pattern. If a PHT miss occurs, a default
prediction is used. A strong default is important for DGMS;
we propose a heuristic based on per-thread spatial locality.
If the average number of referenced words per line is fewer
than 3.75, the immediately requested words are used as the
default prediction. Otherwise, the predictor defaults to a
coarse-grained prediction. This heuristic is based on the observation that fetching approximately 4 or more FG words
is often inefficient due to high control overheads.
3.3.2 Local and global prediction control The SPP
effectively predicts potentially referenced words, thereby
GPC
Decision
Algorithm 1 Calculating row-buffer hit rate. addr is the
address of a request from L2.
Accesses = Accesses +1
bk = get bank addr( addr )
row = get row addr( addr )
if row 6= row buffer status[ bk ] then
if queue[ bk ] is full then
BankConflicts = BankConflicts +1
oldest row = get row addr( find oldest addr in the queue [ bk ] )
remove all the entries with row addr equal
to the oldest row in the queue[ bk ]
row buffer status[ bk ] = oldest row
if oldest row 6= row then
push the addr into the queue[ bk ]
end if
else
push the addr into the queue[ bk ]
end if
end if
Page Hitr Rate = 1 - (BankConflicts / Accesses)
minimizing off-chip traffic. The goal of DGMS, however,
is to maximize system throughput and power efficiency by
predicting spatial locality in DRAM access streams. As
discussed in AGMS [41], FG memory accesses increase
DRAM control overhead; an overabundance of FG requests
is undesirable even if it reduces the total data traffic. Thus,
we employ a two-level prediction control mechanism that
combines local prediction with global adjustment.
Local prediction controller: The LPC in each core monitors thread access patterns and determines ModeLPC , which
is based upon two metrics: the average number of referenced words per cache line and the row-buffer hit rate (per
thread). The former represents spatial locality within a
cache line, and the latter measures spatial locality across
cache lines. If the average number of referenced words exceeds 3.75 or if the row-buffer hit rate is greater than 0.8,
ModeLPC is set to CG; otherwise, it is set to Transparent.
The spatial pattern predicted by the SPP is ignored if
ModeLPC is CG, but we defer the actual decision to the GPC
at the memory controller to take into account memory requests across all the cores. Thus, ModeLPC is attached to
every request from L1 (both reads and writes) as in Figure 8.
We measure per-thread row-buffer hit rate by observing
traffic just below the last core-private cache (L2 in our case);
after this point in the memory hierarchy, requests from different cores are interleaved, making per-thread observation
difficult. We analyze the row-buffer hit rate of each L2 miss
or eviction using a simple DRAM row-buffer model that
manages a 4-entry scheduling queue per bank (assuming 32
memory banks, a 4kB row-buffer per bank, and FR-FCFS
scheduling [30]). Note that this model does not include
timing and only counts the number of requests and bankconflict requests. Algorithm 1 shows how we count bank
conflicts and estimate the row-buffer hit rate.
Global prediction controller: The GPC at the memory
controller dynamically adjusts the access granularity based
on the memory controller status, SPP predictions, and LPC
N
FRACCG > 0.8
N
Y
FRACCG > 0.6
Y
ModeLPC == CG
Y
N
FG
Transparent
CG
Figure 9: Global prediction decision logic. FRACCG is the fraction of coarsegrained requests in the memory controller queue, and ModeLPC is the LPC’s
decision bundled with the request.
decisions (ModeLPC ). Figure 9 illustrates the GPC decision
logic. When one type of request (CG or FG) dominates the
memory controller queue, the GPC forces incoming transactions to the dominating one (CG or FG mode), ignoring
the SPP. If neither CG nor FG request dominates the memory controller queue, the memory controller follows the decision made by the LPC and the SPP. This global override is
important to maximize memory throughput rather than just
minimize memory traffic.
4. Evaluation Methodology
To evaluate DGMS, we use detailed cycle-based simulation. We integrate the Zesto simulator [23] with DrSim [18],
a detailed DRAM model. This simulation platform supports
all aspects of DGMS, including the sub-ranked memory
systems as well as the register/demux circuitry described
in Section 3.2.
Workloads: We use a mix of several applications
from SPEC CPU2006 [35], PARSEC [7], Olden [9],
SPLASH2 [39], and the HPCS [1] benchmark suites as well
as the GUPS [12] and STREAM [24] microbenchmarks. Our
collection of benchmarks is primarily memory intensive but
also includes some compute-bound applications. Table 1
summarizes the characteristics of the benchmarks. We use 8
identical instances of single-threaded applications to stress
memory systems in a CMP and also run the application
mixes described in Table 2.
We extract a representative region of 100 million instructions from each application for the cycle-based simulations.
We use Simpoint [16] with the SPEC applications and manually skip the initialization phase for the regularly-behaved
applications (Olden, PARSEC, SPLASH2, HPCS, GUPS
and STREAM).
System configurations: Table 3 describes the base system
configuration used for the cycle-based simulations. Note
that we use a system with relatively low off-chip bandwidth
to evaluate DGMS in the context of future systems, where
off-chip bandwidth is likely to be scarce.
Power models: Our main focus is on the memory hierarchy.
We use the detailed power model developed by the Micron
Corporation [2] for DRAM, and CACTI 6 [26] for the cache
hierarchy. Our processor power analysis uses the IPC-based
model suggested by Ahn et al. [4]. In this model, the max-
Table 1: Benchmark characteristics.
Benchmark suite
SPEC CPU2006
PARSEC
SPLASH2
Olden
HPCS
Microbenchmarks
Application
mcf
omnetpp
bzip2
hmmer
lbm
canneal
streamcluster
OCEAN
mst
em3d
SSCA2
GUPS
STREAM
LLC MPKI
31.3
11.6
3.2
0.87
22.9
17.2
14.5
18.6
41.6
39.4
25.4
174.9
51.9
DRAM page hit rate
19.1
47.8
57.1
91.3
82.6
14.1
86.8
92.6
40.5
27.4
25.5
10.9
96.5
0.8
0.6
0.4
0.2
0.0
Table 3: Simulated base system parameters.
Last-Level caches (LLC)
onchip memory controller
Main memory
4GHz x86 out-of-order core (8 cores)
32kB private, 2-cycle latency, 64B cache line
32kB private, 2-cycle latency, 64B cache line
256kB private for instruction and data,
7-cycle latency, 64B cache line
shared cache, 64B cache line, 8MB,
17-cycle latency, 64B cache line
FR-FCFS scheduler [30],
64-entry read queue, 64-entry write queue,
XOR-based bank, sub-rank mapping [42]
one 72-bit wide DDR3-1066 channel,
64-bit data and 8-bit ECC,
×8 DRAM chips, 8 banks per rank,
4 ranks per channel,
parameters from Micron 1Gb DRAM [25]
imum power per core is estimated to be 16.8W based on a
32nm Xeon processor model using McPAT v0.7 [20]; half
of the maximum power is assumed to be fixed (including
leakage), and the other half is proportional to IPC. To account for the additional overhead for sector caches, register/demux circuitry, and ECC logic, we add a conservative
10% power penalty to the LLC and DRAM power in AGMS
and DGMS. We do not add additional power for the SPP
since it is a very small structure – only 768B per core.
Metrics: We use the weighted speedup (WS) [13] to measure system throughput with multiprogrammed workloads
as shown in Equation 1: N is the number of cores,
IP Cishared is the IPC of the i-th application when running
with other applications, and IP Cialone is the IPC of the i-th
application when running alone in the CMP.
WS =
N
−1
X
i=0
IP Cishared
IP Cialone
Average granularity
MEDIUM
MEDIUM
MEDIUM
COARSE
MEDIUM
FINE
COARSE
COARSE
FINE
FINE
FINE
FINE
COARSE
1.0
SSCA2 ×2, mst ×2, em3d ×2, canneal ×2
SSCA2 ×2, canneal ×2, mcf ×2, OCEAN ×2
canneal ×2, mcf ×2, bzip2 ×2, hmmer ×2
mcf ×4, omnetpp ×4
SSCA2 ×2, canneal ×2, mcf ×2, streamcluster ×2
Processor core
L1 I-caches
L1 D-caches
L2 caches
DRAM traffic
HIGH
HIGH
LOW
LOW
HIGH
HIGH
HIGH
HIGH
HIGH
HIGH
HIGH
HIGH
HIGH
1.2
Table 2: Application mix for 8-core simulations.
MIX1
MIX2
MIX3
MIX4
MIX5
Average words per cache line
3.59
3.22
3.63
7.93
3.92
1.87
7.24
6.68
2.30
2.62
2.63
1.84
7.99
(1)
We also report the system power efficiency in terms of
throughput (WS) per Watt. System power includes the aggregate power of cores, caches, and DRAM. Power efficiency, rather than energy efficiency, is appropriate for this
Predicted, but Not Referenced
Not predicted, but Referenced
Predicted & Referenced
PHT Hit Rate
SSCA2
canneal
em3d
SSCA2
canneal
em3d
mst
mst
gups
gups
mcf
mcf
omnetpp
lbm
omnetpp
lbm
OCEAN streamcluster
stream
OCEAN
s-cluster stream
Figure 10: PHT hit rate and prediction accuracy in the SPP.
study because of our multiprogrammed simulation methodology. While we collect statistics such as IPC for a fixed
number of instructions from each program, the amount of
time over which statistics are gathered varies to ensure fair
contention (additional details in the AGMS paper [41]).
5. Results and Discussion
In this section, we evaluate DGMS. We first discuss the
the accuracy of the spatial pattern predictor in Section 5.1
and then investigate the effectiveness of local and global
prediction in Section 5.2. We present the performance and
power impacts of DGMS in Section 5.3.
5.1. Spatial pattern predictor accuracy
In order to measure the accuracy of the SPP, we run
simulations without local and global prediction. Figure 10
shows the PHT hit rate and prediction accuracy. In most applications (except omnetpp and streamcluster), PHT
hit rate is high and spatial prediction is very accurate – exhibiting a high percentage of “Predicted & Referenced” and
relatively low “Predicted, but Not Referenced” and “Not
predicted, but Referenced” accesses.
To better explore the SPP design space, we run another
simulation with a larger PHT (64-set and 32-way set associative); the results are almost the same as in Figure 10, except for the benchmark omnetpp. In omnetpp, the larger
PHT increases PHT hit rate from 65% to 81%. Overall
performance improvement using a larger PHT is marginal,
however, so we do not use a large PHT in this study. Another notable application is streamcluster, which suffers from many “Not predicted, but referenced” accesses.
This low prediction accuracy results in significant performance degradations when spatial prediction is used alone.
Weighted Speedup
6
5
4
3
2
1
0
Weighted
Weighted
Speedup
Speedup
6
5
CG
(Hit Rate)
Local-control (Hitrate)
Local + WB control
SSCA2 canneal
canneal em3d
SSCA2
em3d
mst
mst
gups
gups
SPP-only
Local-control (AvgRefWords)
Global-control
mcf omnetpp
omnetpp lbm
lbm OCEAN
OCEAN s-cluster
s-cluster stream
MIX1
mcf
stream MIX1
Figure 11: Effects of SPP, LPC, and GPC.
MIX2
MIX2
CG
AGMS
D-AGMS-profiling
DGMS-profiling
D-AGMS-prediction
DGMS-prediction
MIX3
MIX3
MIX4
MIX4
MIX5
MIX5
4
3
2
1
0
SSCA2 canneal
canneal em3d
em3d
mst
gups
mcf omnetpp
omnetpp lbm
lbm OCEAN
OCEAN s-cluster
s-cluster stream
stream
MIX1
MIX2
MIX3
MIX4
MIX5
SSCA2
mst
gups
mcf
MIX1
MIX2
MIX3
MIX4
MIX5
Figure 12: System throughput. Results based on partitioned register/demux. Stacked black bars represent additional gain due to unconstrained register/demux.
As such, streamcluster illustrates the importance of
using a combination of control mechanisms to achieve robust performance gains across different workloads.
5.2. Effects of local and global control prediction
Figure 11 presents the effects of local and global prediction: CG is the CG-only baseline; SPP-only is DGMS
with only spatial prediction (no local and global predictors);
Local-control (hit rate) is DGMS with local prediction control using hit rate based decisions only; Local-control (AvgRefWords) uses local prediction control basing decisions
on the average referenced words; Local + WB control uses
full local prediction control (based on both hit rate and average referenced words), controlling write-backs as well; and
Global control uses the combined local and global prediction control.
In most applications, SPP-only works well and local/global prediction control does not significantly alter performance. However, SPP-only degrades applications with high spatial locality (lbm, OCEAN, and
streamcluster). While hit rate based local control
works well with lbm, it is not sufficient for OCEAN and
streamcluster, in which applying local prediction control to write-backs is very effective. Experiments using only
hit-rate based local prediction combined with write-back
control fail to achieve performance comparable to the CG
baseline; the row-buffer hit rate of streamcluster (0.7)
is below the threshold (0.8) in the LPC, and some local prediction based on the number of referenced words is needed.
MIX2 is an interesting case; it is negatively impacted by
the most sophisticated local prediction (Local + WB control). In MIX2, only OCEAN has high spatial locality, while
the other (more memory intensive) applications have low
spatial locality. OCEAN generates CG requests, negatively
impacting the memory controller (even if CG is an optimal
decision for OCEAN). Though the memory controller can
split a CG request into FG requests, it is not as effective as
serving a single granularity (if possible). The global prediction control detects and corrects granularity inefficiencies
by monitoring the queue status at the memory controller
and disabling the LPC’s decisions to achieve better performance.
5.3. Performance and power impacts
Figure 12 compares the system throughput of the CG
baseline, AGMS [41], DGMS-profiling (DGMS with the
same static granularity decision as in AGMS), and DGMSprediction (DGMS with spatial pattern prediction and local/global prediction control).
Effects of register/demux configuration: We use the partitioned register/demux in both AGMS and DGMS; the
stacked black bars represent additional gains possible with
an unconstrained register/demux. In AGMS, an FG request
accesses two DRAM chips (one for data and the other for
ECC), both in the same partition; hence, the partitioned register/demux performs as effective as the unconstrained register/demux. With the DGMS data layout, however, the benefits of the unconstrained register/demux are apparent. It
provides high effective ABUS bandwidth and has the greatest impact on applications that have high ABUS utilization, achieving a throughput improvement of 16–24% for
SSCA2, em3d, GUPS, omnetpp, MIX2, and MIX5.
Low spatial locality applications: Applications such as
SSCA2, canneal, em3d, mst, and GUPS have very low
spatial locality and typically only access one or two words
per cache line. As a result, adaptive granularity significantly improves system throughput: AGMS by 20 − 220%,
DGMS-profiling by 20 − 180%, and DGMS-prediction by
stream
MIX1
MIX2
MIX3
MIX4
CG
AGMS
DGMS
CG Data
CG
AGMS
DGMS
CG
AGMS
DGMS
CG
AGMS
DGMS
OCEAN s-cluster
CG ECC
CG
AGMS
DGMS
lbm
FG Data
CG
AGMS
DGMS
omnetpp
CG
AGMS
DGMS
mcf
CG
AGMS
DGMS
gups
FG ECC
CG
AGMS
DGMS
mst
CG
AGMS
DGMS
CG
AGMS
DGMS
CG
AGMS
DGMS
em3d
CG
AGMS
DGMS
canneal
CG
AGMS
DGMS
SSCA2
CG
AGMS
DGMS
CG
AGMS
DGMS
Traffic [Bytes/Instr]
24.4
8
6
4
2
0
MIX5
CG
AGMS
DGMS
MIX1
MIX2
MIX3
MIX4
CG
AGMS
DGMS
CG
AGMS
DGMS
CG
AGMS
DGMS
CG
AGMS
DGMS
OCEAN s-cluster stream
Background
CG
AGMS
DGMS
lbm
Refresh
CG
AGMS
DGMS
omnetpp
CG
AGMS
DGMS
mcf
ACT/PRE
CG
AGMS
DGMS
gups
CG
AGMS
DGMS
RD/WR
CG
AGMS
DGMS
mst
I/O
CG
AGMS
DGMS
SSCA2 canneal em3d
CG
AGMS
DGMS
CG
AGMS
DGMS
Ref/Demux
Reg/Demux
CG
AGMS
DGMS
10
8
6
4
2
0
CG
AGMS
DGMS
DRAM Power [W]
Figure 13: Off-chip traffic. AGMS with the partitioned register/demux and DGMS with the unconstrained register/demux.
MIX5
Figure 14: DRAM power.
18 − 180%. The reason why AGMS consistently outperforms DGMS is that AGMS exhibits more regular access
patterns. FG requests in AGMS are aligned in neighboring sub-ranks, whereas the unified data / ECC layout randomizes the ECC for FG blocks in DGMS. As a result, the
bank conflict rate increases significantly with the DGMS
data layout. For example, the DRAM row-buffer hit rate
of SSCA2 is 10% with AGMS but drops to almost 4% with
DGMS-profiling, although both configurations use the same
profile data for granularity decisions.
Effects of new data/ECC layout: The new layout of
DGMS has advantages over that of AGMS and can significantly reduce ECC traffic. DGMS makes a single ECC access for all subsectors in the cache, while AGMS requires a
separate ECC access for each sector. The benefits of fewer
ECC accesses are very apparent when considering DRAM
traffic (described later in this subsection).
It is hard to isolate the throughput gain of fetching less
ECC from the degradation due to increased DGMS bank
conflicts. The results of GUPS and mcf, however, provide some useful insights in this area. GUPS accesses 1
word per cache line, so DGMS cannot take advantage of
reduced ECC traffic. As such, the performance degradation of DGMS (relative to AGMS) is mainly due to the increased bank conflicts from its memory layout. In contrast
to GUPS, mcf significantly benefits from DGMS, outperforming AGMS by 30%. The mcf application accesses an
average of 3.6 words per cache line, such that the new data
layout of DGMS significantly reduces ECC traffic.
High spatial locality applications: Applications that have
high spatial locality, such as libquantum, OCEAN,
streamcluster (s-cluster in the graph), and STREAM,
do not benefit much from adaptive granularity. The profiler marks nearly all pages as CG in AGMS and DGMS-
profiling. In DGMS-predictor, the GPC (global prediction
controller) forces CG accesses almost exclusively.
One interesting case is lbm, which accesses 2–3 words
per cache line. FG accesses effectively reduce off-chip traffic, as expected. However, lbm’s memory access streams
show very high row-buffer hit rates, and simply using CG
requests (chosen through local prediction control) yields
better performance. However, with 4× ABUS bandwidth
(twice the address bandwidth of the chosen configuration),
lbm without local/global prediction control results in 5%
higher performance than CG.
Off-chip traffic and DRAM power: Figure 13 compares
the off-chip traffic of the CG baseline with AGMS and
DGMS-prediction. While both AGMS and DGMS reduce
off-chip traffic (36% lower in AGMS and 44% lower in
DGMS), DGMS shows consistently lower traffic than that
of AGMS with one exception. This is due to the data layout
of DGMS, which allows only 1 ECC word for a 64B cache
line regardless of how many words are referenced. Hence,
DGMS can reduce ECC traffic when more than 1 word is
accessed, as with mcf and omnetpp. SPP over-fetches in
em3d, yielding slightly more traffic than that of AGMS but
still generates radically less traffic than the CG baseline.
DGMS also reduces DRAM power as shown in Figure 14. Compared to the CG baseline, AGMS reduces
DRAM power by 3% and DGMS by 13%, on average.
In applications with high spatial locality (libquantum,
OCEAN, streamcluster, and STREAM), AGMS and
DGMS use 10% higher DRAM power than the CG baseline due to the register/demux.Note that the 10% penalty
for the register/demux is a very conservative estimate, and
the CG baseline will have a similar penalty when registered
DIMMs or Buffer-on-Boards are used.
Power efficiency: Figure 15 shows the normalized throughput per unit power. We measure the whole-system power in-
Normalized
Throughput/Power
3.2 2.8
2.0
CG
AGMS
DGMS
1.5
1.0
0.5
0.0
SSCA2
canneal
em3d
mst
gups
mcf
omnetpp
lbm
OCEAN s-cluster
stream
MIX1
MIX2
MIX3
MIX4
MIX5
Weighted Speedup
Figure 15: Power efficiency.
7
6
5
4
3
2
1
0
CG
DGMS-prediction (ECC)
DGMS-prediction (No ECC)
SSCA2
SSCA2 canneal
canneal em3d
em3d
AGMS (ECC)
DGMS-profiling (No ECC)
mst
gups
mcf
MIX1
mst
gups
mcf omnetpp
omnetpp lbm
lbm OCEAN
OCEAN s-cluster
s-cluster stream
stream
MIX1
Figure 16: System throughput of AGMS and DGMS with and without ECC.
cluding cores, caches, and DRAM for estimating power efficiency. Though DGMS reduces DRAM power consumption by 13% on average, the system power is dominated
by the processor cores: 8 cores consume 69 − 72W out of
around 80W of total system power. Therefore, the system
power efficiency is heavily correlated to the system throughput. DGMS improves power efficiency when compared to
the CG baseline by 30% on average and by factors of nearly
2 and 3 with canneal and GUPS, respectively.
DGMS without ECC: We also evaluate DGMS without
ECC. When ECC is disabled, DGMS can further improve
system throughput since it does not suffer from ECC rowbuffer interference and bank conflicts. Figure 16 presents
the system throughput of CG, AGMS (with ECC), DGMSprediction (with ECC), DGMS-profiling (without ECC),
and DGMS-prediction (without ECC). Note that DGMSprofiling and AGMS are the exact same design in a system
without ECC.
Without ECC support, both DGMS-profiling and
DGMS-prediction outperform DGMS with ECC. Furthermore, dynamic locality prediction (DGMS-prediction)
garners additional gains relative to static profilingbased DGMS-profiling and AGMS (canneal, omnetpp,
MIX1, and MIX2). MIX3, for which DGMS with ECC
performs worse than CG, is now improved by 29%. Overall, DGMS-prediction without ECC provides an additional
gain of 22% compared to DGMS-prediction with ECC and
improves system throughput by 55% over the CG baseline.
6. Related Work
Adaptive granularity: DGMS is based on prior work,
AGMS [41], and shares many features in common with
AGMS. DGMS uses a unified data/ECC layout to allow
multi-granularity memory accesses to the same memory
space, obviate software support, and enable dynamic granu-
MIX2
MIX2
MIX3
MIX3
MIX4
MIX4
MIX5
MIX5
larity adaptation. DGMS is a hardware-only solution which
retains the main advantages of AGMS while simultaneously
reducing implementation difficulties.
DRAM systems: The idea of sub-ranked memory is
described in many recent proposals, including Rambus’s threaded-module [38], mini-ranks [43], HP’s MCDIMM [5, 4], and Convey’s S/G DIMM [8]. Most of these
approaches focus on reducing the energy of CG accesses.
S/G DIMM [8] is designed for FG accesses, but no detailed
quantitative analysis is provided.
Caches: We evaluate DGMS with sector caches [22] to
manage both CG and FG data in the cache hierarchy.
A more advanced architecture, such as a decoupled sectored cache [32], a pool-of-sectors cache [31], or a spatio/temporal cache [15], can better manage FG data in the
cache hierarchy. The simple sector cache is used because
it enables a fair comparison among DGMS, AGMS, and
a conventional CG-only memory system and isolates improvements to the memory interface.
Spatial locality prediction: We use the prior designs of
spatial footprint prediction [19] and spatial pattern prediction [10]. We adapt this idea to the main memory interface
and introduce adaptive local and global overriding of spatial
locality prediction to match the needs of multigranularity
memory access scheduling in modern DRAM systems.
7. Caveats and Future Work
While the alternative data layout proposed for DGMS
has substantive, practical advantages, its adoption complicates two possible DRAM system optimizations: DRAM
critical word first and single-pin failure protection with
SEC-DED ECC.
With the proposed new layout, it is no longer possible to
access the critical word first at the DRAM boundary. ECC
information can only be checked after an entire burst has
16B data block
CRC
Parity
Burst 16
Figure 17: A simple erasure code that provides chipkill-level protection for
DGMS. A 7-bit CRC provides error detection for each 16B data block. When
an error occurs, the CRC locates the erroneous sub-rank and horizontal parity
corrects the error. The remaining bits in the 16B CRC block can be used for
error detection in the CRC chip itself and/or in the parity chip. One caveat is
that a write back requires a read-modify-write operation to correctly update the
parity information.
been received, rather than after each DRAM beat, which
is possible with conventional mapping. We simulated the
SPEC CPU2006 benchmarks on a 4-core CMP with and
without critical word first support. The results show that
DRAM critical word first improves system throughput by
less than 1% in all simulated cases.
The second implementation issue is that the proposed
layout cannot tolerate a single pin failure, which is possible
with the conventional layout. A single pin failure corrupts
multiple (up to 8) bits within an FG data block, whereas the
commonly used SEC-DED ECC can only correct a single
bit failure. In the conventional design, a pin failure manifests as a single bit failure in every beat and can be corrected
by SEC-DED ECC.
Tolerating a pin failure, however, is not the primary goal
of a SEC-DED system, which is designed for soft errors.
For strong reliability guarantees against permanent failures,
some variant of chipkill-correct is typically used [11]. We
sketch a possible chipkill-correct configuration with DGMS
and present it in Figure 17. Note that the minimum access
granularity increases to 16B, but overall redundancy level
is unchanged. Maintaining both chipkill-correct protection
level and 8B access granularity requires either increasing
the redundancy level or employing techniques such as Virtualized ECC [40], which decouples ECC information from
data storage. While further work remains to investigate alternative error protection schemes with DGMS, levels of error protection stronger than SEC-DED are clearly feasible.
A detailed evaluation of such designs is beyond the scope
of this paper.
8. Conclusion
In this paper, we present DGMS, a hardware-only solution to dynamically adapt memory access granularities.
Adapting the access granularity utilizes scarce bandwidth
more efficiently by dynamically balancing traffic and control overheads. DGMS uses a new data / ECC layout combined with spatial footprint prediction to remove the need
for software interaction and control. Taking software out
of the loop increases the utility of the adaptive granularity
concept as well as its potential impact.
In our experiments, DGMS improves the system
throughput of memory-intensive applications with low or
medium spatial locality by 31%, while reducing DRAM
power by 13% and DRAM traffic by 44%. DGMS generally matches the execution characteristics of traditional CGonly systems for applications with high spatial locality. The
dynamic granularity predictor is very accurate and consistently outperforms software-profiling based granularity decisions. The benefits of dynamic prediction over static profiling are more significant when considering DRAM traffic
and power.
We will explore memory scheduling algorithms that are
more suitable for mixed-granularity access and will investigate better global feedback mechanisms for choosing access
granularities. We also plan a more detailed design and evaluation of strong chipkill-correct schemes that build on the
initial proposal discussed in the previous section.
Finally, while we evaluate dynamic granularity in the
context of main memory, we believe that DGMS can be applied to many other systems where interface bandwidth is
constrained. For example, DGMS can be particularly useful for memory architectures such as disaggregated memory [21], Violin memory [36], and PCIe-attached phasechange memory [6], all of which have a relatively lowbandwidth interface.
9. Acknowledgments
This work is supported, in part, by the following organizations: The National Science Foundation under Grant
#0954107, Intel Labs University Research Research Office
for the Memory Hierarchy Innovation program, and The
Texas Advanced Computing Center.
References
[1] HPCS
scalable
synthetic
compact
application
(SSCA).
http://www.highproductivity.org/
SSCABmks.htm.
[2] Calculating memory system power for DDR3. Technical Report TN-41-01, Micron Technology, 2007.
[3] D. Abts, A. Bataineh, S. Scott, G. Faanes, J. Schwarzmeier,
E. Lundberg, M. Byte, and G. Schwoerer. The Cray Black
Widow: A highly scalable vector multiprocessor. In Proc. the
Int’l Conf. High Performance Computing, Networking, Storage, and Analysis (SC), Nov. 2007.
[4] J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S.
Schreiber. Future scaling of processor-memmory interfaces.
In Proc. the Int’l Conf. High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2009.
[5] J. H. Ahn, J. Leverich, R. Schreiber, and N. P. Jouppi. Multicore DIMM: An energy efficient memory module with independently controlled DRAMs. IEEE Computer Architecture
Letters, 8(1):5–8, Jan. - Jun. 2009.
[6] A. Akel, A. M. Caulfield, T. I. Mollov, R. K. Gupta, and
S. Swanson. Onyx: A protoype phase-change memory storage array. In Proc. the 3rd USENIX conference on Hot topics
in storage and file systems (Hot Storage), Jun. 2011.
[7] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC
benchmark suite: Characterization and architectural implications. Technical Report TR-811-08, Princeton Univ., Jan.
2008.
[8] T. M. Brewer. Instruction set innovations for the Convey HC1 computer. IEEE Micro, 30(2):70–79, 2010.
[9] M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. Technical Report TR-483-95,
Princeton University, 1995.
[10] C. Chen, S.-H. Yang, B. Falsafi, and A. Moshovos. Accurate
and complexity-effective spatial pattern prediction. In Proc.
the 10th Int’l Symp. High-Performance Computer Architecture (HPCA), Feb. 2004.
[11] T. J. Dell. A white paper on the benefits of chipkill-correct
ECC for PC server main memory. IBM Microelectronics Division, Nov. 1997.
[12] Earl Joseph II. GUPS (giga-updates per second) benchmark. http://www.dgate.org/˜brg/files/dis/
gups/.
[13] S. Eyerman and L. Eeckhout. System-level performance
metrics for multiprogram workloads. IEEE Micro, 28(3):42–
53, 2008.
[14] X. Fan, W.-D. Weber, and L. A. Barroso. Power provisioning
for a warehouse-sized computer. In Proc. the 34th Ann. Int’l
Symp. Computer Architecture (ISCA), Jun. 2007.
[15] A. Gonzalez, C. Aliagas, and M. Valero. A data cache with
multiple caching strategies tuned to different types of locality.
In Proc. the Int’l Conf. Supercomputing (ICS), Jul. 1995.
[16] G. Hamerly, E. Perelman, J. Lau, and B. Calder. SimPoint
3.0: Faster and more flexible program analysis. In Proc.
the Workshop on Modeling, Benchmarking and Simulation
(MoBS), Jun. 2005.
[17] J. Huh, D. Burger, and S. Keckler. Exploring the design
space of future cmps. In Parallel Architectures and Compilation Techniques, 2001. Proceedings. 2001 International
Conference on, pages 199 –210, 2001.
[18] M. K. Jeong, D. H. Yoon, and M. Erez. DrSim: A platform
for flexible DRAM system research. http://lph.ece.
utexas.edu/public/DrSim.
[19] S. Kumar and C. Wilkerson. Exploiting spatial locality in
data caches using spatial footprints. In Proc. the 25th Ann.
Int’l Symp. Computer Architecture (ISCA), Jun. 1998.
[20] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M.
Tullsen, and N. P. Jouppi. McPAT: An integrated power, area,
and timing modeling framework for multicore and manycore
architectures. In Proc. the 42nd Ann. IEEE/ACM Int’l Symp
Microarchitecture (MICRO), Dec. 2009.
[21] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F. Wenisch. Disaggregated memory for expansion and sharing in blade servers. In Proc. the 36th Int’l Symp.
Computer Architecture (ISCA), Jun. 2009.
[22] J. S. Liptay. Structural aspects of the system/360 model 85,
part II: The cache. IBM Systems Journal, 7:15–21, 1968.
[23] G. H. Loh, S. Subramaniam, and Y. Xie. Zesto: A cycle-level
simulator for highly detailed microarchitecture exploration.
In Proc. the Int’l Symp. Performance Analysis of Software and
Systems (ISPASS), Apr. 2009.
[24] J. D. McCalpin. STREAM: Sustainable memory bandwidth in high performance computers. http://www.cs.
virginia.edu/stream/.
[25] Micron Corp. Micron 1 Gb ×4, ×8, ×16, DDR3 SDRAM:
MT41J256M4, MT41J128M8, and MT41J64M16, 2006.
[26] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi.
CACTI 6.0: A tool to model large caches. Technical Report
HPL-2009-85, HP Labs, Apr. 2009.
[27] R. C. Murphy and P. M. Kogge. On the memory access patterns of supercoputer applications: Benchmark selection and its implications. IEEE Transactions on Computers,
56(7):937–945, Jul. 2007.
[28] D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (RAID). In Proc. the ACM
SIGMOD International Conference on Management of data,
Jun. 1988.
[29] M. K. Qureshi, M. A. Suleman, and Y. N. Patt. Line distillation: Increasing cache capacity by filtering unused words in
cache lines. In Proc. the 13th Int’l Symp. High Performance
Computer Architecture (HPCA), Feb. 2007.
[30] S. Rixner, W. J. Dally, U. J. Kapasi, P. R. Mattson, and J. D.
Owens. Memory access scheduling. In Proc. the 27th Ann.
Int’l Symp. Computer Architecture (ISCA), Jun. 2000.
[31] J. B. Rothman and A. J. Smith. The pool of subsectors cache
design. In Proc. the 13th Int’l Conf. Supercomputing (ICS),
Jun. 1999.
[32] A. Seznec. Decoupled sectored caches: Conciliating low
tag implementation cost. In Proc. the 21st Ann. Int’l Symp.
Computer Architecture (ISCA), Apr. 1994.
[33] C. Slayman.
Impact and mitigation of DRAM and
SRAM soft errors.
IEEE SCV Reliability Seminar
http://www.ewh.ieee.org/r6/scv/rl/
articles/Soft%20Error%20mitigation.pdf,
May 2010.
[34] S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and
A. Moshovos. Spatial memory streaming. In Proc. the 33rd
Ann. Int’l Symp. Computer Architecture (ISCA), Jun. 2006.
[35] Standard Performance Evaluation Corporation. SPEC CPU
2006. http://www.spec.org/cpu2006/, 2006.
[36] Violin Memory Inc. Scalable memory applicance. http:
//violin-memory.com/DRAM.
[37] Z. Wang, G. A. Jullien, and W. C. Miller. An efficient tree
architecture for modulo 2n + 1 multiplication . Journal of
VLSI Signal Processing, 14:241–248, Dec. 1996.
[38] F. A. Ware and C. Hampel. Improving power and data efficiency with threaded memory modules. In Proc. the Int’l
Conf. Computer Design (ICCD), Oct. 2006.
[39] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta.
The SPLASH-2 programs: Characterization and methodological considerations. In Proc. the 22nd Ann. Int’l Symp. Computer Architecture (ISCA), Jun. 1995.
[40] D. H. Yoon and M. Erez. Virtualized and flexible ECC for
main memory. In Proc. the 15th Int’l. Conf. Architectural
Support for Programming Languages and Operating Systems
(ASPLOS), Mar. 2010.
[41] D. H. Yoon, M. K. Jeong, and M. Erez. Adaptive granularity memory systems: A tradeoff between storage efficiency
and throughput. In Proc. the 38th Ann. Int’l Symp. Computer
Architecture (ISCA), 2011.
[42] Z. Zhang, Z. Zhu, and X. Zhang. A permutation-based page
interleaving scheme to reduce row-buffer conflicts and exploit
data locality. In Proc. the 33rd IEEE/ACM Int’l Symp. Microarchitecture (MICRO), Dec. 2000.
[43] H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and
Z. Zhu. Mini-rank: Adaptive DRAM architecture for improving memory power efficiency. In Proc. the 41st IEEE/ACM
Int’l Symp. Microarchitecture (MICRO), Nov. 2008.