Multi-Layer Memory Resiliency
Invited Paper in Special Session “Embedded Resiliency: Approaches for the Next Decade”
Nikil Dutt1, Puneet Gupta2, Alex Nicolau1,
Abbas BanaiyanMofrad1, Mark Gottscho2, Majid Shoushtari1
1
Department of Computer Science
University of California, Irvine
Irvine, CA 92697
{dutt,nicolau,abanaiya,anamakis}@uci.edu
ABSTRACT
With memories continuing to dominate the area, power, cost and
performance of a design, there is a critical need to provision
reliable, high-performance memory bandwidth for emerging
applications. Memories are susceptible to degradation and
failures from a wide range of manufacturing, operational and
environmental effects, requiring a multi-layer hardware/software
approach that can tolerate, adapt and even opportunistically
exploit such effects. The overall memory hierarchy is also highly
vulnerable to the adverse effects of variability and operational
stress. After reviewing the major memory degradation and failure
modes, this paper describes the challenges for dependability
across the memory hierarchy, and outlines research efforts to
achieve multi-layer memory resilience using a hardware/software
approach. Two specific exemplars are used to illustrate multilayer memory resilience: first we describe static and dynamic
policies to achieve energy savings in caches using aggressive
voltage scaling combined with disabling faulty blocks; and second
we show how software characteristics can be exposed to the
architecture in order to mitigate the aging of large register files in
GPGPUs. These approaches can further benefit from semantic
retention of application intent to enhance memory dependability
across multiple abstraction levels, including applications,
compilers, run-time systems, and hardware platforms.
1. INTRODUCTION
The advent of many-core computing platforms exacerbates the
classical
processor-memory
performance
bottleneck.
Traditionally, memory hierarchies have attempted to address this
performance bottleneck by keeping frequently accessed data close
to where they are consumed (e.g., by caching). However,
contemporary design processes also need to guarantee other nonfunctional constraints such as power, energy and thermal bounds.
Furthermore, since memories occupy a significant percentage of a
chip’s area, the memory subsystem has become vulnerable to a
host of manufacturing, environmental, and operational
failure/degradation mechanisms that affect the overall resiliency
of the system. This paper outlines memory resilience challenges
and opportunities across and between multiple levels of
abstraction in a typical hardware/software design flow for
computing systems (see Figure 1). The overall discussion is
focused on systems-on-chip (SoCs), although similar analyses can
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for
components of this work owned by others than the author(s) must be
honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee. Request permissions from
Permissions@acm.org.
DAC ’14, June 01 - 05 2014, San Francisco, CA, USA
Copyright 2014 ACM 978-1-4503-2730-5/14/06$15.00.
http://dx.doi.org/10.1145/2593069.2596684
2
Department of Electrical Engineering
University of California, Los Angeles
Los Angeles, CA 90095
puneet@ee.ucla.edu, mgottscho@ucla.edu
be made for large-scale distributed systems as well. Section 2
describes memory abstractions across the design hierarchy shown
in Figure 1, the typical causes of memory errors, and error
manifestations at each abstraction level. Sections 3 and 4 use
memory voltage scaling and wearout, respectively, as exemplars
for multi-layer memory resiliency approaches. Section 5 outlines
challenges for managing manufacturing variability and describes
memory-related efforts within the NSF Variability Expedition
project that aims to opportunistically exploit and manage
hardware variability through software mechanisms. Section 6
closes with the outlook for multi-level memory resilience.
System
Abstraction
Memory
Abstraction
Application
Program,
Data Structures,
Files,
Libraries
Operating
System
Main Memory,
File System
Address Space,
Heap, Stack
ISA/RTL/
Arch
Buffers,
Register File,
L1$, L2$, SPM
Logic
Memory Cells,
Bit Arrays
Circuit/
Device
Voltage,
Current,
Transistor,
Cell
Error
Manifestation
ERROR!
0
10010100
Process
variation
Opportunity
Incorrect Output,
Infinite Loop,
Crash
Trade
Performance or
Accuracy for
Energy Saving
Wrong Pointer,
Erroneous System
Call, Trap
Exploit Memory
Mapping for
Reliable vs
Unreliable Pages
Faulty Word,
Cache Block,
Way
Approximate
Data Storage
Bit flip,
Stuck @ 0/1
Operate at
Lower Precision
Low Noise
Margin,
Unstable Cell,
Vth Variation
Relax Hardware
Guardbands
Figure 1. Memory Abstractions, Errors, and Opportunities.
This paper is part of the DAC special session on “Embedded
Resiliency: Approaches for the Next Decade”. Other papers in this
session are: “Monitoring Reliability in Embedded Processors – A
Multi-layer View” [68], “Multi-Layer Dependability: From
Microarchitecture to Application Level” [69], and “Workloadand Instruction-Aware Timing Analysis – The missing Link
between Technology and System-level Resilience” [70].
2. MEMORIES AND ERRORS
Figure 1 shows the typical hardware/software abstraction layers
for computing systems. Each row of Figure 1 describes the
system abstraction layer, the memory abstraction at that level, and
typical manifestations of memory errors that can compromise
system resiliency. The last column of Figure 1 describes
opportunities for relaxed and approximate computing in the face
of memory error manifestations at that level of abstraction.
Memory errors manifest themselves in different ways across
abstraction stack. For instance, an unstable memory cell at the
circuit/device level can cause a bit failure at the memory logic
level, which in turn might propagate up the abstraction stack as a
faulty memory access at the architecture level, a wrong function
call or system halt at OS-level, and finally an output error or an
exception at application layer.
Figure 1 represents a symbolic abstraction of memory errors over
the entire hardware/software system stack. Traditionally, memory
resilience has been addressed via disparate techniques at each
level of design abstraction, while newer efforts attempt to couple
strategies across layers with the goal of improving system
efficiency for energy, heat dissipation, lifetime, cost, etc.
Furthermore, efforts in relaxed and approximate computing
attempt to create designs that can trade off application quality for
these system efficiency goals.
To understand memory faults, we can classify them by their
temporal behaviors (persistence) as well as their causes. With
respect to persistence, a memory fault can be permanent or
transient. Permanent faults persist indefinitely in the system after
occurrence, while transient faults manifest for a relatively short
period of time after occurrence. Furthermore, causes of memory
faults can be hard or soft. Hard faults are static and caused by
device failure or wear-out failure. In contrast, soft faults are
dynamic and are typically caused by the operating environment.
Memories suffer from different sources of unreliability that can be
classified into three main groups:
•
•
•
Manufacturing. Worsening manufacturing imperfections in
nanoscale technologies result in increasing variability of
device and circuit-level parameters. This process variation
particularly affects transistor threshold voltages through
random dopant fluctuation (RDF), increasing the likelihood
of memory cells failing permanently due to insufficient noise
margins at a given supply voltage.
Environmental. Alpha particle radiation coming from the
operating environment can cause single event upsets (SEU).
Combined with weakened noise margins from manufacturing
effects, memory cells are also becoming more susceptible to
SEU, impacting their soft error resilience [1]. Noise
stemming from variations in the supply voltage and thermal
effects can also cause memory faults exhibiting dynamic and
random behavior.
Aging and Wearout. Depending on the type of technology
used, memory cells can age, reducing their performance, data
retention capability, and power consumption. Aging can lead
to memory wearout, resulting in permanent faults.
Different memory technologies suffer from various sources of
unreliability. Volatile memories such as SRAM and DRAM
mostly suffer from manufacturing defects and environmental
issues that lead to hard and soft errors, respectively. Endurance is
not an issue in SRAM and DRAM. In contrast, different nonvolatile memories (NVMs) have their own sources of
unreliability. For flash and phase change memory (PCM), wearout
is the primary source of unreliability due to limited write
endurance. PCMs also suffer from hard and soft errors [2]. Other
emerging NVMs such as MRAM and its newer cousin STT-RAM
also suffer from hard and soft errors. However, for these devices,
wearout is not as great of a reliability threat, because they have
large write endurances similar to that of SRAM.
The design of reliable computer systems has a rich history
spanning several decades: variants of spatial, temporal, and
information redundancy have been exploited to improve
reliability. Memory systems also deploy these forms of
redundancy to achieve resilience across various layers of system
abstraction. Additionally, memory designers have leveraged a
variety of other memory-specific techniques.
Here, we provide a sampling of common techniques used for
reliable memory design at the architectural level. A significant
body of research exists on the design of a reliable memory
hierarchy comprising multiple levels of caches and main memory.
Fault-tolerant memory designs have often used simple techniques
such as adding redundant rows/columns to the memory array [18]
or applying memory down-sizing techniques by disabling a faulty
row or cache line (block) [20].
Information redundancy via error coding is also commonly used
to improve the reliability of memory components. Wide ranges of
error detection and correction codes (EDC and ECC, respectively)
have been used [7]. Typically, EDCs are simple parity codes,
while the most common ECCs use Hamming [8] or Hsiao [9]
codes. ECC is proven as an effective mechanism for handling soft
errors. For NVMs that have limited write endurance, various
wear-leveling approaches have been proposed to mitigate aging
and extend memory lifetime.
For many embedded applications, hardware controlled caches do
not provide predictable performance and can also be energy
inefficient. Consequently, caches are increasingly replaced by or
augmented with software-controlled scratchpad memories
(SPMs). The design of reliable SPMs has also received great
attention recently, including efforts that address the reliability of
SPMs for chip-multiprocessors (E-RoC [15] and SPMVisor [16]),
or for hybrid memories (FTSPM [17]).
Surprisingly, very little work has attempted to leverage higherlevel semantic retention [67] to assist at all levels of unreliability.
Indeed, by having a “big-picture” understanding of what data
structures/parts-thereof are accessed, how frequently, and in what
way during a program phase, and relating these to the fault
profiles of the underlying memory subsystems, one could improve
the efficiency of (or even eliminate the need for) recovery
mechanisms in both hardware and software.
An exhaustive survey of memory resilience is beyond the scope of
this paper. However, in the next two sections we present two
recent research topics – resilient caches and memory aging – as
vehicles to illustrate opportunities for multi and cross-layer
memory resilience. For each case, we briefly explain ongoing
efforts and highlight an exemplar study that leverages a multilayer approach toward improving memory resilience.
3. RESILIENT CACHES
We can categorize resilient SRAM cache design efforts into three
main groups. Many of these have the common property of “faulttolerant voltage-scalable” (FTVS) design, because low voltage
operation – while critical for achieving power and energy savings
– is the primary driver behind unreliable memories. In general,
regardless of whether the fault-tolerant design is done at the cell,
circuit, coding, or architecture level, there is a tradeoff in terms of
memory capacity and area. This may be due to larger memory
cells, spare or redundant cells, error correction logic, or a reduced
amount of reliable memory available for use by the application.
3.1 Cell and Circuit-Level Techniques
The root of most SRAM reliability problems is the cell noise
margin. At low supply voltages, noise margins are reduced,
increasing susceptibility to data corruption caused by
environmental factors described earlier. Furthermore, variability
in cell noise margins requires a statistical approach to designing a
reliable memory array and choice of minimum supply voltage,
which must be increased to maintain yield under large variations.
Single error correction double error detection (SECDED) is a
widely used coding technique for protecting memory structures
against soft errors. When greater error detection is necessary,
more complex multi-bit error correction schemes have also been
proposed. Double error correction triple error detection
(DECDED), two-dimensional ECC (2D-ECC) [10], multiple-bit
segmented ECC (MS-ECC) [11], Hi-ECC [12], variable-strength
ECC (VS-ECC) [13], and Memory Mapped ECC [14] are some of
the more notable schemes. Besides common codes such as
Hamming [8] and Hsiao [9], other strong codes such as BCH [12],
OLSC [11], and Reed Solomon [7] have also been used to gain
strong error detection. However, ECC techniques generally come
at high cost due to significant memory storage and logic
overheads. Despite this, ECC remains a popular method for
memory resilience due to its effectiveness against soft errors, and
the lack of involvement from other layers of abstraction.
3.3 Architecture-Level Techniques
Many architecture-level schemes deploy redundancy or capacity
downsizing techniques to improve the reliability of cache
memories. Earlier works on fault-tolerant cache design use simple
techniques by adding redundant rows/columns to the cache [18] or
disabling faulty cache block, sets, and/or ways [20]. Similarly,
Wilkerson et al. [21] proposed multiple techniques using part of a
cache line as redundancy for defective bits for the rest of cache
lines in the same set. PADed cache [19] and Agarwal’s design [1]
program column multiplexer and address decoders to select nonfaulty blocks, respectively.
Other efforts, such as In-Cache Replication (ICR) [23] and MultiCopy Cache (MC2) [22], use data replication to improve
reliability. Schemes such as Replication Cache [24] and
ZerehCache [25] use external spare caches. Similarly, variants of
fault-grouping and fault remapping have been used to tolerate
faulty cache blocks without adding any spare elements, but by
using other parts of the cache, such as GRP2 [26], RDC-Cache
[27], Abella [28], Archipelago [29], and FFT-Cache [36].
Wilkerson’s scheme [21] also could fall under this category.
In all the above schemes, algorithmic and compiler semantic
retention could help enhance the efficiency of the proposed
mechanisms, by facilitating more accurate remapping, accurate
(more limited) replication, and/or more efficient relocation
approaches. Some hybrid schemes combine multiple techniques
mentioned earlier to minimize the costs of memory protection.
Zhou [30] minimizes area overhead through joint optimization of
cell size, redundancy, and ECC; and Ndai [31] performs circuitarchitecture codesign for memory yield improvement.
More recent architectural schemes for cache resilience address
newer challenges for multi and many-core platforms, such as
scalability [32][59], variation in fault behaviors [11], non-uniform
memory access latency [59], limited shared redundancy [33], lowoverhead multi-VDD support [37], and high costs of uniform
design [34][35].
3.4 Power/Capacity Scaling
We now turn to our most recent work [37] as an exemplar for
cross-layer resilient cache design. Many works in resilient SRAM
Thus, we proposed in [37] a better metric for evaluating FTVS
SRAM caches: power versus effective capacity. For example, one
can consider an ECC-based cache as either having a power
overhead for a given amount of bit storage, or for a given amount
of power, fewer bits that are usable to store data. These sorts of
tradeoffs are captured appropriately by this metric, and enable
more effective cross-layer design.
We realized that employing sophisticated ECC, block-level
redundancy or address remapping can achieve very low supply
voltages, but not the best design tradeoff in power vs. capacity.
When voltage scaling an SRAM array, there is a critical point
where the memory becomes virtually useless due to very high bit
error rates. Fault tolerance mechanisms allow incrementally lower
voltages, but at ever-increasing costs in area, power, performance,
and complexity. Thus, it appears that tolerating many errors for
low voltage operation can quickly become a fool’s errand.
In [37] this realization led us to come up with a simple FTVS
SRAM cache architecture for energy savings. The idea is to
achieve a better power/capacity tradeoff for a cache by using
ultra-lightweight fault tolerance that gracefully degrades cache
utility as voltage is lowered. Essentially, an offline or built-inself-test (BIST) routine identifies blocks that have any faulty bits
at each pre-determined VDD level. Using the so-called fault
inclusion property [37], we keep a very small fault map (1-2 bits
per block) in the tag array, which is not voltage scaled. At any
given runtime voltage, the fault map directly controls power gate
transistors which disable blocks that are unreliable for further
power savings. Meanwhile, the cache controller prohibits valid
data from being placed in a faulty block. From the software’s
perspective, the cache capacity is reduced at low voltage, causing
more misses, but otherwise the cache operates correctly with good
power savings. However, the yield could be affected since each
set requires at least one non-faulty block at all runtime voltages.
Normalized
Static Power
3.2 Error Coding Techniques
caches target power reduction by enabling low voltage operation.
As described earlier in Section 2, low voltage operation results in
higher probability of faulty memory cells, thus requiring some
form of fault tolerance. Thus, there is a tradeoff between power
(as it depends on supply voltage) and fault tolerance overheads (in
terms of area, performance, and power). Despite this, most faulttolerant voltage-scalable (FTVS) SRAM cache designs emphasize
the metric of minimum achievable VDD at fixed yield. This can
be misleading when judging the efficacy of such an approach.
1.2
1.0
0.8
0.6
0.4
0.2
0.0
FFT-Cache
Way-Based Power Gating (Generic)
Proposed Power/Capacity Scaling
0.0
0.2
0.4
0.6
Proportion of Usable Blocks
0.8
1.0
Figure 2. Static power vs. effective capacity for three different
SRAM cache resizing approaches [37].
Yield
Engineers have designed larger memory cells using more
transistors and/or larger transistors to increase mean noise margins
and/or reduce margin variability, but these come at the cost of
reduced area efficiency and sometimes power. Several of these
circuit-level techniques include 8T [3][4], 10T [5], and Schmidt
Trigger (ST) [6] SRAM cells.
1.0
0.8
0.6
0.4
0.2
0.0
Conventional
SECDED
DECTED
FFT-Cache
Proposed PCS
0.3
0.4
0.5
0.6
0.7
0.8
Data Array Cell VDD (V)
0.9
1
Figure 3. Yield vs. VDD for several different fault-tolerant
voltage-scalable (FTVS) SRAM cache approaches [37].
Figure 2 illustrates the benefit of our power/capacity scaling
approach compared with power gating and FFT-Cache [36] (one
of our recent FTVS works), for trading off power and capacity.
This is despite the inability of the proposed power/capacity
scaling method to achieve the lowest voltage at any yield target
(Figure 3), motivating further studies in this direction.
We proposed in [37] two policy variants of power/capacity
scaling: static (SPCS) and dynamic (DPCS). SPCS allows the
system software or cache controller to choose the optimal cache
voltage at boot time, based on knowledge of faulty blocks gained
through BIST, to achieve a minimum of 99% fault-free blocks.
While SPCS is simple and can greatly reduce the voltage
guardband, it ignores the opportunity for even better energy
savings through cross-layer hardware/software optimization.
DPCS allows the system to adapt the cache VDD at runtime in
response to varying workload behaviors. In [37] we had the cache
controller adapt the voltage in response to changing miss rates and
an estimate of the miss penalty. When many misses were
encountered at low voltage, the controller raises VDD to make
more blocks available for use and thereby reduces capacity and
conflict misses. When few misses are encountered, the controller
reduces VDD to opportunistically save power.
Higher level semantics can mitigate the effect of the reduced
cache size on performance (e.g., by simply increasing power –
and hardware reliability - in phases of execution where the cache
is fully utilized) or more interestingly, by using the higher-level
information to adapt the organization/utilization of the data so as
to minimize misses given the faulty-cache configuration. More
sophisticated cross-layer policies are part of our ongoing work.
With knowledge of the power/capacity scaling mechanism and
particular cache operating points, software could be optimized at
compile-time or runtime to improve energy efficiency with
minimal performance degradation.
4. MEMORY AGING AND WEAROUT
We now review sample efforts that cope with wearout in
memories and their limited lifetime at different levels of
abstraction. As with resilient caches, higher-level semantic
retention can help, by using information about how different
program and algorithm-level structures are utilized (frequency of
access, of reads of writes, their mappings at bank or cache level,
etc. in different phases of program execution) to both increase
efficiency of execution in the presence of faults, and to alleviate
the expense of recovery mechanisms in software or hardware. We
also illustrate how program characteristics can be exposed to the
hardware in order to mitigate wearout effects, using the example
of large GPGPU register files.
4.1 Wearout Mechanisms and Their Effects
Wearout mechanisms are different depending on the type of the
memory family. While electron tunneling degrades the oxide layer
in flash memory cells, SRAM is threatened by negative-bias
temperature instability (NBTI) which weakens the drive current of
PMOS devices. Furthermore, wearout effects are also different for
each memory type. Wearout in NVMs limits the number of
reliable writes. In SRAM, it decreases the stability of cells,
especially for the read operation. Although wearout in NVMs is
typically irreversible, SRAM wearout is partially recoverable.
4.2 Improving NVM Write Endurance
Traditional memory management techniques are write-variation
oblivious and therefore cause parts of the memory to reach its
maximum write count much earlier than the rest. Thus, most
approaches for enhancing write endurance of NVMs are based on
two ideas: (1) uniformly distributing writes over the whole
memory space, and (2) reducing the number of write operations.
4.2.1 Flash as Main Memory
Approaches for wear-leveling in flash memories fall into two
categories. First, dynamic wear leveling (DWL) techniques look
at all of the available blocks that are free and select the one with
the lowest erase count for next write. However, they do not move
cold data afterwards [38]. Second, static wear leveling (SWL)
techniques try to prevent cold data from staying at any block for a
long period of time. If the difference between two blocks’ erase
counts is too large, SWL starts erasing young blocks by moving
cold data away from them [39].
4.2.2 PCM as Main Memory
Architectural Level Solutions: Flip-N-Write [40] is a microarchitectural technique that performs a read-before-write to decide
whether to write the original data or its flipped version depending
on which causes fewer bit flips. This is transparent to the rest of
the system and the memory device takes care of inverting data
whenever required. The authors in [41] consider manufacturing
variation, which causes the programming current to be adjusted,
based on the most difficult-to-reset cell. Instead of sacrificing
lifetime of other cells, they use a lower programming current
through Fine-Grained Current Regulation, allowing difficult-toreset cells to be recovered by error correcting pointers (ECP).
OS Level Solutions: Dhiman et al. propose PDRAM [42] for
hybrid PCM and DRAM memories. The operating system’s page
manager uses the page-level access frequency of PCM pages,
tracked by hardware, in order to perform wear leveling. The OS
also tries to swap hot pages from PRAM to DRAM. By changing
the memory controller, the TLBs, and the operating system, the
authors of [43] dynamically form clean pages out of pages with
faulty bits. This enables continued operation through graceful
degradation when cells fail.
Application Level Solutions: A recent work by Sampson et al.
[44] offers a new perspective for improving PCM lifetime.
Through annotations, the application developer can identify some
program variables as candidate for approximate storage. Hardware
exploits this by reducing number of programming pulses for that
part of physical memory that holds this data. In addition, even
failed cells are used for storing approximate data.
4.2.3 PCM as On-Chip SPM
HaVOC [66] uses a combination of programmer annotations and a
data volatility metric to simultaneously save energy while
increasing the lifetime of NVMs. The volatility metric measures
write frequency of a piece of data over its accumulated lifetime.
Variable annotations are used to pass this metric to the run-time
system, allowing the SPM manager to prioritize mapping of data
with higher write frequency to be put in on-chip SPM. Thus by
reducing the write operations to NVM, not only is the energy
consumption of SPM reduced, but also its life-time is increased.
4.2.4 ReRAM as On-Chip Last-Level Cache
[45] proposes inter/intra-set write variation-aware cache policy
(i2WAP) for ReRAM caches. Using address remapping, it
uniformly distributes cache writes between all of the cache sets.
This solves the problem of inter-set variation but within a set, hot
cache lines are accessed more frequently because of locality. To
solve this, i2WAP slightly modifies the Least Recently Used
(LRU) replacement policy by intelligently writing back hot data at
some timestamp and invalidating the corresponding line. The
invalidated line would be a candidate for the next replacement,
possibly for cold data.
4.3 Mitigating SRAM Aging
4.3.1 Architectural Level Solutions for SRAM Caches
[46] proposes Dynamic Indexing for SRAM caches. The authors
observe that in a partitioned cache architecture, some of the
partitions are idle during most of the application execution time,
while some others are accessed more. They exploit this behavior
by putting idle partitions in drowsy mode (i.e., drooped VDD).
This slows down the wearout of SRAM cells in those partitions.
Also the cache indexing function is changed over time in order to
uniformly distribute the idleness over all of the partitions.
4.3.2 Software Level Solution for SRAM SPMs
[47] presents a library of C-functions for wearout-aware data
allocation on physically-banked SPMs. For data allocation,
SPM_malloc calls the SPM controller which is aware of the
current wearout status of each bank. This controller distributes
allocation requests over the SPM banks in such a way that all
banks could spend the same amount of time in drowsy mode.
4.4 Register File Aging Case Study: ARGO
Extreme multithreading with fast thread switching in GPGPUs is
supported by large register files (RFs) that are much larger than
on-chip caches holding the execution state of each thread. To
protect these register files against NBTI, ARGO [48] exposes
program characteristics to the hardware in order to design a lowoverhead stress distributer. In ARGO’s flow (Figure 4), the
OpenCL compiler embeds some metadata in binary code,
including number of required registers for that kernel and its
maximum amount of required memory. Based on this information,
the host CPU at runtime decides on how many threads to assign to
each workgroup. Depending on the kernel requirements and
resource limitations not all of the available register file space can
be used. On average, 46% underutilization is observed for
execution of 15 common general purpose kernels. In such a flow,
the compiler helps the underlying hardware by letting it know
how much of the register space is required by the software. The
RF allocator then power-gates unused parts of the register file,
thereby not only saving leakage power, but more importantly
ameliorating aging by putting that part in NBTI recovery mode.
Furthermore the RF allocator employs a virtual sensing approach
to estimate the aging profile of different RF banks in a relative
manner. Based on that, and without any need of having on-chip
NBTI sensors, it circulates the allocated space in the entire
physical space of RF over time to enhance the RF lifetime.
Binary Code
Compiler
Source Code
(OpenCL)
Metadata:
Reg. # Mem
Ultra-threaded Dispatcher
Software Side
Hardware Side
CU i
RF Allocator
X
Y
Z
W
Power Gaters
...... .
Figure 4. ARGO Overview.
5. VARIABILITY EXPEDITION
Variability in computer systems can stem from semiconductor
manufacturing, ambient operating conditions, wearout over time,
and differing vendors. The NSF Variability Expedition (Figure 5)
[49] seeks to build opportunistic computing systems where
hardware variations are monitored and exposed to software layers
(instead of being hidden behind pessimistic margins) enabling
adaptations. The work has spanned circuit-level monitoring and
test (e.g., [50], [51], [60]), variability emulation ([52], [53]),
runtime for embedded systems (e.g., [54], [55]), GPUs (e.g., [56],
[48]), processors (e.g., [56], [58]), memories (e.g., [55], [64],
[59]) and storage (e.g., [61], [62]). In the following, we briefly
describe some of the research on memory variability done under
the Variability Expedition.
DRAMs were observed to have over 20% read/write power
variation [63] which was leveraged in [64] by dynamically
adapting virtual to physical address mapping in the Linux
operating system. The approach preferentially allocates frequently
accessed data on to lower power memory DIMMs.
SRAM arrays are known to have large variations which limit their
minimum operating voltage and hence power. [15] achieves
reliability through redundancy by optimizing RAID-like policies
tuned for on-chip distributed scratchpad memories at lower power
cost than ECC with voltage overscaling. Extending this, [55]
allows programmers to partition their application’s address space
(through annotations) into virtual address regions and create
mapping policies for each region depending on their requirements
(fault tolerance, power, etc). In the cache context, FFT-Cache [36]
uses sophisticated fault tolerance schemes in cache organization to
achieve a lower operating voltage, while [37] described earlier
does this using simple fault tolerance mechanisms for lower
overheads.
Figure 5. The Underdesigned and Opportunistic Computing
vision of the NSF Variability Expedition [49].
Measurements show systematic variation in program latency
within and across multi-level flash devices [65]. [62] extends
conventional flash translation layers to schedule flash program
operations on pages based on operations performance
requirements and specific pages’ performance characteristics.
Based on the observation that, for multi-level cell flash, whenever
a cell error occurs, with high probability only one bit in the cell
has error, [61] proposed an error correcting code based on
generalized tensor products.
The increasing fraction of memory real estate coupled with
emerging memory technologies with varying variability
mechanisms make architecture and software-level handling of
memory variations an integral part of the Variability Expedition.
6. SUMMARY AND CONCLUSIONS
In this paper, we highlighted efforts and opportunities for
achieving memory resiliency both within and across multiple
layers of the abstraction stack. To enable cross-layer memory
resilience, it is important to understand the abstractions of
memories, manifestations of memory errors and memory
vulnerability at multiple levels. Our paper gave a sampling of
these memory issues within the context of complex SoC designs.
We also used two exemplars (resilient caches and memory aging)
to illustrate multi-layer strategies for enhancing resilience.
While traditional memory resilience efforts have focused
primarily on hardware, it is increasingly important to develop
software-enabled mechanisms for managing memory resilience.
Moving forward, we should see efforts that synergistically
combine hardware and software approaches to overcome adverse
effects of memory failures, and also which opportunistically
exploit application semantics for achieving more efficient designs,
particularly in the context of applications that tolerate some level
of quality degradation (e.g., approximate computing). System
designers will need abstractions, tools, and methods to enable
effective exploration of the memory resiliency design space.
7. ACKNOWLEDGMENTS
This work was partially supported by NSF Variability Expedition
Grant Numbers CCF-1029783 and CCF-1029030.
8. REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
A. Agarwal et al. A process-tolerant cache architecture for improved yield in
nanoscale technologies. IEEE TVLSI, 2005.
D. H. Yoon et al. FREE-p: Protecting non-volatile memory against both hard
and soft errors. Proc. HPCA, 2011.
Y. Morita et al. An area-conscious low voltage-oriented 8T-SRAM design
under DVS environment. Proc. Symp. on VLSI Circuits, 2007.
N. Verma and A. Chandrakasan. A 256 Kb 65 nm 8T subthreshold SRAM
employing sense-amplifier redundancy. IEEE JSSC, 2008.
B. Calhoun and A. Chandrakasan. A 256 Kb sub-threshold SRAM in 65nm
CMOS. Proc. ISSCC, 2006.
J. P. Kulkarni et al. A 160 mV, fully differential, robust schmitt trigger based
sub-threshold SRAM. Proc. ISLPED, 2007.
S. Lin and D. J. Costello. Error control coding, second edition. Prentice-Hall,
Inc., 2004.
R. W. Hamming. Error correcting and error detecting codes. Bell System Tech.
Jour., 1950.
M. Y. Hsiao. A class of optimal minimum odd-weight-column SECDED codes.
IBM J. Research and Devel., 1970.
J. Kim et al. Multi-bit error tolerant caches using two-dimensional error
coding. Proc. MICRO, 2007.
Z. Chishti et al. Improving cache lifetime reliability at ultra-low voltages.
Proc. MICRO, 2009.
C. Wilkerson et al. Reducing cache power with low-cost, multi-bit errorcorrecting codes. Proc. ISCA, 2010.
A. Alameldeen et al. Energy-efficient cache design using variable-strength
error correcting codes. Proc. ISCA, 2011.
D. H. Yoon and M. Erez. Memory mapped ECC: low-cost error protection for
last level caches. Proc. ISCA, 2009.
L. Bathen and N. Dutt. E-RoC: embedded raids-on-chip for low power
distributed dynamically managed reliable memories. Proc. DATE, 2011.
L. Bathen et al. SPMVisor: dynamic scratchpad memory virtualization for
secure, low power, and high performance distributed on-chip memories. Proc.
CODES+ISSS, 2011.
A.M.H. Monazzah et al. FTSPM: a fault-tolerant scratchpad memory. Proc.
DSN, 2013.
S. Schuster. Multiple word/bit line redundancy for semiconductor memories.
IEEE JSCC, 1978.
P. Shirvani and E. McCluskey. PADded cache: a new fault tolerance technique
for cache memories. Proc. VTS, 1999.
S. Ozdemir et al. Yield-aware cache architectures. Proc. MICRO, 2006.
C. Wilkerson et al. Trading off cache capacity for reliability to enable low
voltage operation. Proc. ISCA, 2008.
A. Chakraborty et al. E < MC2: less energy through multi-copy cache. Proc.
CASES, 2010.
W. Zhang et al. ICR: in-cache replication for enhancing data cache reliability.
Proc. DSN, 2003.
W. Zhang. Replication cache: a small fully associative cache to improve data
cache reliability. IEEE TC, 2005.
A. Ansari et al. ZerehCache: armoring cache architectures in high defect
density technologies. Proc. MICRO, 2009.
D. Roberts et al. On-chip cache device scaling limits and effective fault repair
techniques in future nanoscale technology. Proc. DSD, 2007.
A. Sasan et al. A fault tolerant cache architecture for sub 500mV operation:
resizable data composer cache (RDC-cache). Proc. CASES, 2009.
J. Abella et al. Low VCC-min fault-tolerant cache with highly predictable
performance. Proc. MICRO, 2010.
A. Ansari et al. Archipelago: a polymorphic cache design for enabling robust
near-threshold operation. Proc. HPCA, 2011.
S.T. Zhou et al. Minimizing total area of low-voltage SRAM arrays through
joint optimization of cell size, redundancy, and ECC. Proc. ICCD, 2010.
View publication stats
[31] P. Ndai et al. A scalable circuit-architecture co-design to improve memory
yield for high-performance processors. IEEE TVLSI, 2010.
[32] A. BanaiyanMofrad et al. A novel NoC-based design for fault-tolerance of lastlevel caches in CMPs. Proc. CODES+ISSS, 2012.
[33] A. BanaiyanMofrad et al. Modeling and analysis of fault-tolerant distributed
memories for networks-on-chip. Proc. DATE, 2013.
[34] S. Paul et al. Reliability-driven ECC allocation for multiple bit error resilience
in processor cache. IEEE TC, 2011.
[35] P. Ampadu et al. Breaking the energy barrier in fault-tolerant caches for
multicore systems. Proc. DATE, 2013.
[36] A. BanaiyanMofrad et al. FFT-Cache: a flexible fault-tolerant cache
architecture for ultra low voltage operation. Proc. CASES, 2011.
[37] M. Gottscho et al. Power/capacity scaling: energy savings with simple faulttolerant caches. Proc. DAC, 2014.
[38] L. P. Chang. On efficient wear-leveling for large-scale flash-memory storage
systems. Proc. SAC, 2007.
[39] Y. H. Chang et al. Endurance enhancement of flash-memory storage systems:
an efficient static wear leveling design. Proc. DAC, 2007.
[40] S. Cho and H. Lee. Flip-N-Write: a simple deterministic technique to improve
PRAM write performance, energy, and endurance. Proc. MICRO, 2009.
[41] L. Jiang et al. Enhancing phase change memory lifetime through fine-grained
current regulation and voltage upscaling. Proc. ISLPED, 2011.
[42] G. Dhiman et al. PDRAM: a hybrid PRAM and DRAM main memory system.
Proc. DAC, 2009.
[43] E. Ipek et al. Dynamically replicated memory: building resilient systems from
unreliable nanoscale memories. ASPLOS, 2010.
[44] A. Sampson et al. Approximate storage in solid-state memories. Proc. MICRO,
2013.
[45] J. Wang et al. i2WAP: improving non-volatile cache lifetime by reducing interand intra-set write variations. Proc. HPCA, 2013.
[46] A. Calimera et al. Dynamic indexing: leakage/aging co-optimization for
caches. IEEE TCAD, 2014.
[47] D. Papagiannopoulou et al. Flexible data allocation for scratch-pad memories
to reduce NBTI effects. Proc. ISQED, 2013.
[48] M. Namaki-Shoushtari et al. ARGO: aging-aware GPGPU register file
allocation. Proc. CODES+ISSS, 2013.
[49] P. Gupta et al. Underdesigned and opportunistic computing in presence of
hardware variability. IEEE TCAD, 2012.
[50] L. Lai and P. Gupta. Accurate and inexpensive performance monitoring for
variability-aware systems. Proc. ASP-DAC, 2014.
[51] P. Singh et al. Dynamic NBTI management using a 45nm multi-degradation
sensor. Proc. CICC, 2010.
[52] H. Cho et al. Quantitative evaluation of soft error injection techniques for
robust system design. Proc. DAC, 2013.
[53] L. Wanner et al. VarEMU: an emulation testbed for variability-aware software.
Proc. CODES+ISSS, 2013.
[54] L. Wanner et al. Hardware variability-aware duty cycling for embedded
sensors. IEEE TVLSI, 2012.
[55] L. Bathen et al. VaMV: variability-aware memory virtualization. Proc. DATE,
2012.
[56] A. Rahimi et al. Aging-aware compiler-directed VLIW assignment for
GPGPU architectures. Proc. DAC, 2013.
[57] M. Fojtik et al. Bubble razor: an architecture-independent approach to timingerror detection and correction. Proc. ISSCC, 2012.
[58] J. Sartori et al. Stochastic computing: embracing errors in architecture and
design of processors and applications. Proc. CASES, 2011.
[59] A. BanaiyanMofrad et al. REMEDIATE: a scalable fault-tolerant architecture
for low-power NUCA cache in tiled CMPs. Proc. IGCC, 2013.
[60] M. Sauer et al. Early-life failure detection using SAT-based ATPG. Proc. ITC,
2013.
[61] R. Gabrys et al. Tackling intracell variability in TLC dlash through tensor
product codes. Proc. ISIT, 2012.
[62] L. M. Grupp et al. The harey tortoise: managing heterogeneous write
performance in SSDs. Proc. USENIX Ann. Tech. Conf., 2013.
[63] M. Gottscho et al. Power variability in contemporary DRAMs. IEEE ESL,
2012.
[64] L. Bathen et al. ViPZonE: OS-level memory variability-aware physical address
zoning for energy savings. Proc. CODES+ISSS, 2012.
[65] L. Grupp et al. Characterizing flash memory: anomalies, observations, and
applications. Proc. MICRO, 2009.
[66] L. Bathen and N. Dutt. HaVOC: a hybrid memory-aware virtualization layer
for on-chip distributed scratchpad and nonvolatile memories. Proc. DAC,
2012.
[67] S. Novack et al. A simple mechanism for improving the accuracy and
efficiency of instruction-level disambiguation. Proc. LCPC, 1995
[68] V. Chandra. Monitoring reliability in embedded processors – A multi-layer
view. Proc. DAC, 2014.
[69] J. Henkel et al. Multi-layer dependability: from microarchitecture to
application level. Proc. DAC, 2014.
[70] V. B. Kleeberger et al. workload- and instruction-aware timing analysis – the
missing link between technology and system-level resilience. Proc. DAC, 2014