Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DAWG: A Defense Against Cache Timing Attacks in Speculative Execution Processors

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

DAWG: A Defense Against Cache Timing Attacks

in Speculative Execution Processors∗


Vladimir Kiriansky† , Ilia Lebedev† , Saman Amarasinghe† , Srinivas Devadas† , Joel Emer‡

MIT CSAIL, ‡ NVIDIA / MIT CSAIL
{vlk, ilebedev, saman, devadas, emer}@csail.mit.edu

victim’s protection domain covert channel attacker’s protection domain


Abstract—Software side channel attacks have become a serious
concern with the recent rash of attacks on speculative processor
secret access transmitter receiver secret
architectures. Most attacks that have been demonstrated exploit code code
the cache tag state as their exfiltration channel. While many data tap attacker-provided
existing defense mechanisms that can be implemented solely in
attacker-provided, or synthesized by attacker from existing victim code or may pre-exist in victim.
software have been proposed, these mechanisms appear to patch
specific attacks, and can be circumvented. In this paper, we
Fig. 1. Attack Schema: an adversary 1) accesses a victim’s secret, 2) transmits
propose minimal modifications to hardware to defend against it via a covert channel, and 3) receives it in their own protection domain.
a broad class of attacks, including those based on speculation,
with the goal of eliminating the entire attack surface associated
with the cache state covert channel. different from violating integrity (corrupting the results obtained
We propose DAWG, Dynamically Allocated Way Guard, a through program execution).1
generic mechanism for secure way partitioning of set asso- In a well-designed system the attacker cannot architecturally
ciative structures including memory caches. DAWG endows a observe this secret, as the secret should be confined to a
set associative structure with a notion of protection domains protection domain that prevents other programs from observing
to provide strong isolation. When applied to a cache, unlike it architecturally. However, vulnerabilities may exist when an
existing quality of service mechanisms such as Intel’s Cache
Allocation Technology (CAT), DAWG fully isolates hits, misses, attacker can observe side effects of execution via software
and metadata updates across protection domains. We describe means.
how DAWG can be implemented on a processor with minimal The mechanism by which such observations are made are
modifications to modern operating systems. We describe a non- referred to as software side channels. Such channels must be
interference property that is orthogonal to speculative execution
modulated, i.e., their state changed, as a function of activity in
and therefore argue that existing attacks such as Spectre Variant
1 and 2 will not work on a system equipped with DAWG. Finally, the victim’s protection domain and the attacker must be able to
we evaluate the performance impact of DAWG on the cache detect those state changes. Currently, the most widely explored
subsystem. channel is based on the state of a shared cache. For example,
if the attacker observes a hit on an address, the address must
be cached already, meaning some party, maybe the victim,
I. I NTRODUCTION had recently accessed it, and it had not yet been displaced.
Determining if an access is a hit can be accomplished by
For decades, processors have been architected for perfor- measuring the time it takes for a program to make specific
mance or power-performance. While it was generally assumed references.
by computer architects that performance and security are A covert communication channel transfers information
orthogonal concerns, there are a slew of examples, including between processes that should not be allowed to communicate
the recent Google Project Zero attacks [22] (Spectre [31] and by existing protection mechanisms. For example, when a side
Meltdown [35]) and variants [30], that show that performance channel is used to convey a “secret” to an attacker, an attack
and security are not independent, and micro-architectural would include code inside the victim’s protection domain for
optimizations that preserve architectural correctness can affect accessing the secret and a transmitter for conveying the secret
the security of the system. to the attacker. Together they form a data tap that will modulate
the channel based on the secret. A receiver controlled by the
In security attacks, the objective of the attacker is to create
attacker, and outside the victim’s protection domain, will listen
some software that can steal some secret that another piece of
for a signal on the channel and decode it to determine the
code, the victim, should have exclusive access to. The access
secret. This is pictorially illustrated in Fig. 1.
to the secret may be made directly, e.g., by reading the value
A classic attack on RSA relied on such a scenario [9]. Specif-
of a memory location, or indirectly, e.g., inferred from the
ically, existing RSA code followed a conditional execution
execution flow a program takes. In either case, this leakage
of information is referred to as violating isolation, which is 1 Violating isolation and obtaining a secret may result in the attacker being
able to violate integrity as well, since it may now have the capability to modify
memory, but in this paper we focus on the initial attack that would violate
∗ Student and faculty authors listed in alphabetical order. isolation.
sequence that was a function of the secret, and inadvertently that there will be new ways that data taps can be constructed.
transmitted private information by modifying instruction cache We therefore wish to design a defense against a broad class of
state in accord with that execution sequence. This resulted current and future attacks.
in a covert communication that let an observing adversary
determine bits of the secret. In this case, the code that accessed B. Our approach to defense
the secret and the transmitter that conveyed the secret were
Defense mechanisms that can be implemented solely in
pre-existing in the RSA code. Thus, an attacker that shared the
software have been proposed (e.g., [11], [43]). Unfortunately,
icache needed only provide a receiver that could demodulate
these mechanisms appear very attack specific: e.g., a compiler
the secret conveyed over the cache tag state-based channel.
analysis [43] identifies some instances of code vulnerable to
Recent work has shown that a broad space of viable attacks
Spectre Variant 1; microcode updates or compiler and linker
exfiltrate information via shared caches.
fixes reduce exposure to Spectre Variant 2 [11]. Instructions to
A. Generalized Attack Schema turn off speculation in vulnerable regions have been introduced
(e.g., [2]) for future compilers to use. In this paper, we
Recently, multiple security researchers (e.g., [22], [31], [35])
target minimal modifications to hardware that defend against
have found ways for an attacker to create a new data tap in
a broad class of side channel attacks, including those based
the victim. Here, an attacker is able to create a data tap in the
on speculation, with the goal of eliminating the entire attack
victim’s domain and/or influences the data tap to access and
surface associated with exfiltration via changing cache state.
transmit a chosen secret. Spectre and Meltdown have exploited
To prevent exfiltration, we require strong isolation between
the fact that code executing speculatively has full access to
protection domains, which prevents any transmitter/receiver
any secret.
pair from sharing the same channel. Cache partitioning is
While speculative execution is broadly defined, we focus
an appealing mechanism to achieve isolation. Unfortunately,
on control flow speculation in this paper. Modern processors
set (e.g., page coloring [29], [50]) and way (e.g., Intel’s
execute instructions out of order, allowing downstream in-
Cache Allocation Technology (CAT) [21], [23]) partitioning
structions to execute prior to upstream instructions as long
mechanisms available in today’s processors are either low-
as dependencies are preserved. Most instructions on modern
performing or do not provide isolation.
out-of-order processors are also speculative, i.e., they create
We propose DAWG, Dynamically Allocated Way Guard, a
checkpoints and execute along a predicted path while one or
generic mechanism for secure way partitioning of set associative
more prior conditional branches are pending resolution. A
structures including caches. DAWG endows a set associative
prediction resolved to be correct discards a checkpoint state,
structure with a notion of protection domains to provide strong
while an incorrect one forces the processor to roll back to
isolation. Unlike existing mechanisms such as CAT, DAWG
the checkpoint and resume along the correct path. Incorrectly
disallows hits across protection domains. This affects hit paths
predicted instructions are executed, for a time, but do not
and cache coherence [42], and DAWG handles these issues
modify architectural state. However, micro-architectural state
with minimal modification to modern operating systems, while
such as cache tag state is modified as a result of (incorrect)
reducing the attack surface of operating systems to a small
speculative execution causing a channel to be modulated, which
set of annotated sections where data moves across protection
may allow secrets to leak.
domains, or where domains are resized/reallocated. Only in
By exploiting mis-speculated execution, an attacker can exer-
these handful of routines, DAWG protection is relaxed, and
cise code paths that are normally not reachable, circumventing
other defensive mechanisms such as speculation fences are
software invariants. One example has the attacker speculatively
applied as needed. We evaluate the performance implications
executing data tap code that illegally accesses the secret and
of DAWG using a combination of architectural simulation and
causes a transmission via micro-architectural side effects before
real hardware and compare to conventional and quality-of-
an exception is raised [35]. Another example has the attacker
service partitioned caches. We conclude that DAWG provides
coercing branch predictor state to encourage mis-speculation
strong isolation with reasonable performance overhead.
along an attacker-selected code path, which implements a data
tap in the victim’s domain. There are therefore three ways of
C. Contributions and organization
creating the data tap:
1) Data tap pre-exists in victim’s code, which we described The contributions of our paper are:
in the RSA attack [9]. 1) We motivate strong isolation of replacement metadata
2) Attacker explicitly programs the data tap. Meltdown [35] by demonstrating that the replacement policy can leak
is an example of this. information (cf. Section II-B2) in a way-partitioned cache.
3) Attacker synthesizes a data tap out of existing code in the 2) We design a cache way partitioning scheme, DAWG, with
victim — exemplified by Spectre variants [22], [30], [31]. strong isolation properties that blocks old and new attacks
This framework can be applied for side channels other than based on the cache state exfiltration channel (cf. Section
the cache state, describing exfiltration via branch predictor III). DAWG does not require invasive changes to modern
logic or TLB state, for example. Given the intensified research operating systems, and preserves the semantics of copy-
interest in variants of this new attack class, we also imagine on-write resource management.
Leakage via Receiver evicts TA Leakage via Receiver evicts TA , LOAD(X)
Shared Tag Shared Cache Set
CLFLUSH(A) displacing it with TX,TY LOAD(Y) Receiver sets up PLRU metadata: LOAD(A);LOAD(B)
TA ∉ SA Tx TY transmitter is
time Tree-PLRU 0 allocated
metadata 1 x a subset of
cache ways
Transmitter
accesses TA ( or doesn’t )
evict Transmitter
TX or TY accesses TA ( or doesn’t )
TA TB TC ?
TA ∈ SA TA ∉ SA TA TY Tx TA Tx TY Next victim
Receiver checks for TA in SA : miss TX or TY Receiver probes for TA
miss on
t = time()
miss on evict TA, t = time()
hit on Transmitter accesses TX ( or doesn’t )
and fill TA and fill TA fill TA or TY TX and TY
LOAD(A) LOAD(X)
1 1 1 1 0 x
t = time()-t LOAD(Y)
t = time()-t
t is small t is large t is large t is small TA TB TC TX TA TB TC ?
TA ∈ SA Tx TY Receiver observes PLRU
time metadata via an eviction:
LOAD(D)
Fig. 2. Leakage via a shared cache set, implemented via a shared tag TA TD TB TC TX TA TB TD ?
directly, or indirectly via TX , TY ∼
= TA . Receiver probes for TX
t = time ()
3) We analyze the security of DAWG and argue its security LOAD(A)
t is large t is small t = time()-t
against recent attacks that exploit speculative execution
and cache-based channels (cf. Section V-A). TD TB TA TX TA TB TD ?
4) We illustrate the limitations of cache partitioning for
isolation by discussing a hypothetical leak framed by our
attack schema (cf. Fig. 1) that circumvents a partitioned Fig. 3. Covert channel via shared replacement metadata, exemplified by a 4-
cache. For completeness, we briefly describe a defense way set-associative cache with a Tree-PLRU policy, cache allocation boundary.
Tags TA ∼= TB ∼ = TC ∼ = TD ∼ = TX .
against this type of attack (cf. Section V-C).
5) We evaluate the performance impact of DAWG in com-
parison to CAT [21] and non-partitioned caches with a value, and then after the victim runs, observing a difference in
variety of workloads, detailing the overhead of DAWG’s the cache tag state to learn something about the victim process.
protection domains, which limit data sharing in the system A less common yet viable strategy corresponds to observing
(cf. Section VI). changes in coherence [59] or replacement metadata.
The paper is organized as follows. We provide background 1) Cache tag state based attacks: Attacks using cache
and discuss related work in Section II. The hardware modifi- tag state-based channels are known to retrieve cryptographic
cations implied by DAWG are presented in Section III, and keys from a growing body of cryptographic implementations:
software support is detailed in Section IV. Security analysis AES [7], [40], RSA [9], Diffie-Hellman [32], and elliptic-
and evaluation are the subjects of Section V and Section VI, curve cryptography [8], to name a few. Such attacks can be
respectively. Section VII concludes. mounted by unprivileged software sharing a computer with
the victim software [3]. While early attacks required access
II. BACKGROUND AND R ELATED W ORK to the victim’s CPU core, more recent sophisticated channel
We focus on thwarting attacks by disrupting the channel modulation schemes such as flush+reload [60] and variants of
between the victim’s domain and the attacker for attacks that prime+probe [38] target the last-level cache (LLC), which is
use cache state-based channels. We state our threat model in shared by all cores in a socket. The evict+reload variant of
Section II-A, describe relevant attacks in Section II-B, and flush+reload uses cache contention rather than flushing [38]. An
existing defenses in Section II-C. attack in JavaScript that used a cache state-based channel was
A. Threat model demonstrated [39] to automatically exfiltrate private information
upon a web page visit.
Our focus is on blocking attacks that utilize the cache state
These attacks use channels at various levels of the memory
exfiltration channel. We do not claim to disrupt other channels,
cache hierarchy and exploit cache lines shared between an
such as L3 cache slice contention, L2 cache bank contention,
attacker’s program and the victim process. Regardless of the
network-on-chip or DRAM bandwidth contention, branch data
specific mechanism for inspecting shared tags, the underlying
structures, TLBs or shared functional units in a physical core.
concepts are the same: two entities separated by a trust
In the case of the branch data structures, TLBs, or any other set
boundary share a channel based on shared computer system
associative structure, however, we believe that a DAWG-like
resources, specifically sets in the memory hierarchy. Thus,
technique can be used to block the channel associated with the
the entities can communicate (transmitting unwittingly, in the
state of those structures. We assume an unprivileged attacker.
case of an attack) on that cross-trust boundary channel by
The victim’s domain can be privileged (kernel) code or an
modulating the presence of a cache tag in a set. The receiver
unprivileged process.
can detect the transmitter’s fills of tag TA either directly, by
B. Attacks observing whether it had fetched a shared line, or indirectly,
The most common channel modulation strategy corresponds by observing conflict misses on the receiver’s own data caused
to the attacker presetting the cache tag state to a particular by the transmitter’s accesses, as shown in Fig. 2.
2) A cache metadata-based channel: Even without shared Set partitioning allows communication between protection
cache lines (as is the case in a way-partitioned cache), the domains without destroying cache coherence. The downsides
replacement metadata associated with each set may be used as are that it requires some privileged entity, or collaboration, to
a channel. Most replacement policies employ a replacement move large regions of data around in memory when allocating
state bit vector that encodes access history to the cache set in cache sets, as set partitioning via page coloring binds cache set
order to predict the ways least costly to evict in case of a miss. allocation to physical address allocation. For example, in order
If the cache does not explicitly partition the replacement state to give a protection domain 1/8 of the cache space, the same
metadata across protection domains, some policies may violate 12.5% of the system’s physical address space must be given
isolation in the cache by allowing one protection domain’s to the process. In an ideal situation, the amount of allocated
accesses to affect victim selection in another partition. Fig. 3 DRAM and the amount of allocated cache space should be
exemplifies this with Tree-PLRU replacement (Section III-J1): decoupled.
a metadata update after an access to a small partition overwrites Furthermore, cache coloring at page granularity is not
metadata bits used to select the victim in a larger partition. A straightforwardly compatible with large pages, drastically reduc-
securely way-partitioned cache must ensure that replacement ing the TLB reach, and therefore performance, of processes. On
metadata does not allow information flow across the cache current processors, the index bits placement requires that small
partition(s). (4KB) pages are used, and coloring is not possible for large
This means defenses against cache channel-based attacks (2MB) pages. Large pages provide critical performance benefits
have to take into account the cache replacement policy and for virtualization platforms used in the public cloud [44], and
potentially modify the policy to disrupt the channel and hence reverting to small pages would be deleterious.
ensure isolation. 2) Insecure way and fine-grain partitioning: Intel’s Cache
Allocation Technology (CAT) [21], [23] provides a mechanism
C. Defenses to configure each logical process with a class of service, and
Broadly speaking, there are five classes of defenses, with allocates LLC cache ways to logical processes. The CAT
each class corresponding to blocking one of the steps of the manual explicitly states that a cache access will hit if the line
attack described in Fig. 1. is cached in any of the cache’s ways — this allows attackers
1) Prevent access to the secret. For example, KAISER to observe accesses of the victim. CAT only guarantees that
[13], which removes virtual address mappings of kernel a domain fill will not cause evictions in another domain. To
memory when executing in user mode, is effective against achieve CAT’s properties, no critical path changes in the cache
Meltdown [35]. are required: CAT’s behavior on a cache hit is identical to a
2) Make it difficult to construct the data tap. For example, generic cache. Victim selection (replacement policy), however,
randomizing virtual addresses of code, flushing the Branch must be made aware of the CAT configuration in order to
Table Buffer (BTB) when entering victim’s domain [46]. constrain ways on an eviction.
3) Make it difficult to launch the data tap. For example, Via this quality of service (QoS) mechanism, CAT improves
not speculatively executing through permission checks, system performance because an inefficient, cache-hungry
keeping predictor state partitioned between domains, and process can be reined in and made to only cause evictions
preventing user arguments from influencing code with in a subset of the LLC, instead of trashing the entire cache.
access to secrets. The Retpoline [53] defense against The fact that the cache checks all ways for cache hits is also
Spectre Variant 2 [11] makes it hard to launch (or good for performance: shared data need not be duplicated,
construct) a data tap via an indirect branch. and overhead due to internal fragmentation of cache ways is
4) Reduce the bandwidth of side channels. For example, reduced. The number of ways for each domain can also be
removing the APIs for high resolution timestamps in dynamically adjusted. For example, DynaWay [16] uses CAT
JavaScript, as well as support for shared memory buffers with online performance monitoring to adjust the ways per
to prevent attackers from creating timers. domain.
5) Close the side channels. Prevent the attacker and victim CAT-style partitioning is unfortunately insufficient for block-
from having access to the same channel. For example, ing all cache state-based channels: an attacker sharing a
partitioning of cache state or predictor state. page with the victim may observe the victim’s use of shared
The latter is the strategy of choice in our paper, and we consider addresses (by measuring whether a load to a shared address
three subclasses of prior approaches: results in a cache hit). Furthermore, even though domains
1) Set partitioning via page coloring: Set partitioning, i.e., can fill only in their own ways, an attacker is free to flush
not allowing occupancy of any cache set by data from different shared cache lines regardless where they are cached, allowing
protection domains, can disrupt cache state-based channels. It straightforward transmission to an attacker’s receiver via
has the advantage of working with existing hardware when flush&reload, or flush&flush [20]. CAT-style partitioning allows
allocating groups of sets at page granularity [34], [61] via an attacker to spy on lines cached in ways allocated to the
page coloring [29], [50]. Linux currently does not support page victim, so long as the address of a transmitting line is mapped
coloring, since most early OS coloring was driven by the needs by the attacker. This is especially problematic when considering
of low-associativity data caches [51]. Spectre-style attacks, as the victim (OpenSSL, kernel, etc.) can
be made to speculatively touch arbitrary addresses, including Cache port (s) Backend port
Address, Core ID, Address, Core ID,
domain_id, etc domain_id, coherence, etc
those in shared pages. In a more subtle channel, access
patterns leak through metadata updates on hitting loads, as the policies
coherence
Cache controller state machine logic
replacement metadata is shared across protection domains. replacement
policy
way write cache line
Applying DAWG domain isolation to fine-grain QoS parti- Address enables write data

tioning such as Vantage [47] would further improve scalability updated


set
metadata
to high core counts. Securing Vantage, is similar to securing Tag
Set
Index
CAT: hits can be isolated, since each cache tag is associated
with a partition ID; replacement metadata (timestamps or way
write
set
index
new
cache
enable line
RRIP [26]) should be restricted to each partition; addition- W W W 0 ... W 1 2
cache set
3
metadata
ally Vantage misses allow interference, and demotion to the cache
way

unmanaged 10% of the cache, which must be secured. Tag Line

set metadata
3) Reducing privacy leakage from caches: Since Spectre
== == == ==
policy-masked
attacks are outside of the threat model anticipated by prior work, hit
way hits
isolation
most prior defenses are ineffective. LLC defenses against cross-
core attacks, such as SHARP [58] and RIC [28], do not stop cache line

same-core OS/VMM attacks. In addition, RIC’s non-inclusive


read-only caches do not stop speculative attacks from leaking Fig. 4. A Set-Associative Cache structure with DAWG.
through read-write cache lines in cache coherence attacks [52].
PLcache [33], [56] and the Random Fill Cache Architecture
(RFill, [37]) were designed and analyzed in the context of as any receiver in the attacker’s domain. This prevents any
a small region of sensitive data. RPcache [33], [56] trusts communication or leaks of data from the victim to the attacker.
the OS to assign different hardware process IDs to mutually
mistrusting entities, and its mechanism does not directly scale A. High-level design
to large LLCs. The non-monopolizable cache [14] uses a well- Consider a conventional set-associative cache, a structure
principled partitioning scheme, but does not completely block comprised of several ways, each of which is essentially a direct-
all channels, and relies on the OS to assign hardware process mapped cache, as well as a controller mechanism. In order to
IDs. CATalyst [36] trusts the Xen hypervisor to correctly implement Dynamically Allocated Way Guard (DAWG), we
tame Intel’s Cache Allocation Technology into providing will allocate groups of ways to protection domains, restricting
cache pinning, which can only secure software whose code both cache hits and line replacements to the ways allocated to
and data fits into a fraction of the LLC, e.g., each virtual the protection domain from which the cache request was issued.
machine is given 8 “secure” pages. [49] similarly depends on On top of that, the metadata associated with the cache, e.g.,
CAT for the KVM (Kernel-based Virtual Machine) hypervisor. replacement policy state, must also be allocated to protection
Using hardware transactional memory, Cloak [19] preloads domains in a well-defined way, and securely partitioned. These
secrets in cache within one transaction to prevent access allocations will force strong isolation between the domains’
pattern observation of secrets. Blocking channels used by interactions with one another via the cache structure.
speculative attacks, however, requires all addressable memory DAWG’s protection domains are disjoint across ways and
to be protected. across metadata partitions, except that protection domains may
SecDCP [55] demonstrate dynamic allocation policies, as- be nested to allow trusted privileged software access to all
suming a secure partitioning mechanism is available; they ways and metadata allocated to the protection domains in its
provide only ‘one-way protection’ for a privileged enclave purview.
with no communication. DAWG offers the desired partitioning Fig. 4 shows the hardware structure corresponding to a
mechanism; we additionally enable two-way communication DAWG cache, with the additional hardware required by DAWG
between OS and applications, and handle mutually untrusted over a conventional set-associative cache shown highlighted.
peers at the same security level. We allow deduplication, shared The additional hardware state for each core is 24 bits per
libraries, and memory mapping, which in prior work must all hardware thread – one register with three 8-bit active domain
be disabled. selectors. Each cache additionally needs up to 256 bits to
describe the allowed hit and fill ways for each active domain
III. DYNAMICALLY A LLOCATED WAY G UARD (DAWG) (e.g., 16× intervals for a typical current 16-way cache).
H ARDWARE B. DAWG’s isolation policies
The objective of DAWG is to preclude the existence of any DAWG’s protection domains are a high-level property
cache state-based channels between the attacker’s and victim’s orchestrated by software, and implemented via a table of policy
domains. It accomplishes this by isolating the visibility of any configurations, used by the cache to enforce DAWG’s isolation;
state changes to a single protection domain, so any transmitter these are stored at the DAWG cache in MSRs (model-specific
in the victim’s domain cannot be connected to the same channel registers). System software can write to these policy MSRs
for each domain_id to configure the protection domains as policy_fillmap MSRs must be a fence, prohibiting
enforced by the cache. speculation on these instructions. Failing to do so would
Each access to a conventional cache structure is accompanied permit speculative disabling of DAWG’s protection mechanism,
with request metadata, such as a Core ID, as in Fig. 4. DAWG leading to Spectre-style vulnerabilities.
extends this metadata to reference a policy specifying the
protection domain (domain_id) as context for the cache D. DAWG’s cache eviction/fill isolation
access. For a last-level memory cache the domain_id field In a simple example of using DAWG at the last level cache
is required to allow system software to propagate the domain (LLC), protection domain 0 (e.g., the kernel) is statically
on whose behalf the access occurs, much like a capability. The allocated half of DAWG cache’s ways, with the other half
hardware needed to endow each cache access with appropriate allocated to unprivileged software (relegated to protection
domain_id is described in Section III-C. domain 1). While the cache structure is shared among all
Each policy consists of a pair of bit fields, all accessible via software on the system, no access should affect observable
the DAWG cache’s MSRs: cache state across protection domains, considering both the
• A policy_fillmap: a bit vector masking fills and cache data and the metadata. This simple scenario will
victim selection, as described in Sections III-D and III-E. be generalized to dynamic allocation in Section III-H and
• A policy_hitmap: a bit vector masking way hits in we discuss the handling of cache replacement metadata in
the DAWG cache, as described in Section III-F. Section III-J for a variety of replacement policies.
Each DAWG cache stores a table of these policy configura- Straightforwardly, cache misses in a DAWG cache must
tions, managed by system software, and selected by the cache not cause fills or evictions outside the requesting protection
request metadata at each cache access. Specifically, this table domain’s ways in order to enforce DAWG’s isolation. Like
maps global domain_id identifiers to that domain’s policy Intel’s CAT (Section II-C2), our design ensures that only the
configuration in a given DAWG cache. We discuss the software ways that a process has been allocated (via its protection
primitives to manage protection domains, i.e., to create, modify, domain’s policy_fillmap policy MSRs) are candidates for
and destroy way allocations for protection domains, and to eviction; but we also restrict CLFLUSH instructions. Hardware
associate processes with protection domains in Section IV-A1. instrumentation needed to accomplish this is highlighted in
Fig. 4.
C. DAWG’s modifications to processor cores
Each (logical) core must also correctly tag its memory E. DAWG’s cache metadata isolation
accesses with the correct domain_id. To this end, we endow The cache set metadata structure in Fig. 4 stores per-line
each hardware thread (logical core) with an MSR specifying helper data including replacement policy and cache coherence
the domain_id fields for each of the three types of accesses state. The metadata update logic uses tag comparisons (hit
recognized by DAWG: instruction fetches via the instruction information) from all ways to modify set replacement state.
cache, read-only accesses (loads, flushes, etc), and modifying DAWG does not leak via the coherence metadata, as coherence
accesses (anything that can cause a cache line to enter the traffic is tagged with the requestors’s protection domain and
modified state, e.g., stores or atomic accesses). We will refer does not modify lines in other domains (with a sole exception
to these three types of accesses as ifetches, loads, and stores; described in Section III-G).
(anachronistically, we name the respective domain selectors DAWG’s replacement metadata isolation requirement, at a
CS, DS, and ES). Normally, all three types of accesses are high level, is a non-interference property: victim selection in
associated with the same protection domain, but this is not the a protection domain should not be affected by the accesses
case during OS handling of memory during communication performed against any other protection domain(s). Furthermore,
across domains (for example when servicing a system call). the cache’s replacement policy must allow system software to
The categorization of accesses is important to allow system sanitize the replacement data of a way in order to implement
software to implement message passing, and the indirection safe protection domain resizing. Details of implementing
through domain selectors allows domain resizing, as described DAWG-friendly partitionable cache replacement policies are
in Section IV. explored in Section III-J.
The bit width of the domain_id identifier caps the
number of protection domains that can be simultaneously F. DAWG’s cache hit isolation
scheduled to execute across the system. In practice, a single Cache hits in a DAWG cache must also be isolated, requiring
bit (differentiating kernel and user-mode accesses) is a useful a change to the critical path of the cache structure: a cache
minimum, and a reasonable maximum is the number of sockets access must not hit in ways it was not allocated – a possibility
multiplied by the largest number of ways implemented by any if physical tags are shared across protection domains.
DAWG cache in the system (e.g., 16 or 20). An 8-bit identifier Consider a read access with address A =⇒ (TA , SA ) (tag
is sufficient to enumerate the maximum active domains even and set, respectively) in a conventional set associative cache.
across 8-sockets with 20-way caches. A match on any of the way comparisons indicates a cache
Importantly, MSR writes to each core’s domain_id, hit (∃ i | TWi == TA =⇒ hit); the associated cache line
and each DAWG cache’s policy_hitmap and data is returned to the requesting core, and the replacement
policy metadata is updated to make note of the access. This requires a new privileged MSR, with which to invalidate
allows a receiver (attacker) to communicate via the cache state all copies of a Shared line, given an address, regardless of
by probing the cache tag or metadata state as described in protection domain. DAWG relies on system software to prevent
Section II-B. the case of a replicated Modified line.
In DAWG, tag comparisons must be masked with a policy
(policy_hitmap) that white-lists ways allocated to the re- H. Dynamic allocation of ways
quester’s protection domain ( ∃ i | policy hitmap[i] & (TWi It is unreasonable to implement a static protection domain
== TA ) =⇒ hit). By configuring policy_hitmap, policy, as it would make inefficient use of the cache resources
system software can ensure cache hits are not visible across due to internal fragmentation of ways. Instead, DAWG caches
protection domains. While the additional required hardware can be provisioned with updated security policies dynamically,
in DAWG caches’ hit path adds a gate delay to each cache as the system’s workload changes.
access, we note that modern L1 caches are usually pipelined. In order to maintain its security properties, system software
We expect hardware designers will be able to manage an must manage protection domains by manipulating the domains’
additional low-fanout gate without affecting clock frequency. policy_hitmap and policy_fillmap MSRs in the
In addition to masking hits, DAWG’s metadata update DAWG cache. These MSRs are normally equal, but diverge to
must use this policy-masked hit information to modify any enable concurrent use of shared caches.
replacement policy state safely, preventing information leakage In order to re-assign a DAWG cache way, when creating or
across protection domains via the replacement policy state, as modifying the system’s protection domains, the way must be
described in Section III-E. invalidated, destroying any private information in form of the
cache tags and metadata for the way(s) in question. In the case
G. Cache lines shared across domains of write-back caches, dirty cache lines in the affected ways
DAWG effectively hides cache hits outside the white- must be written-back, or swapped within the set. A privileged
listed ways as per policy_hitmap. While this prevents software routine flushes one or more ways via a hardware
information leakage via adversarial observation of cached lines, affordance to perform fine-grained cache flushes by set&way,
it also complicates the case where addresses are shared across e.g., available on ARM [1].
two or more protection domains by allowing ways belonging We require hardware mechanisms to flush a line and/or
to different protection domains to have copies of the same perform write-back (if M), of a specified way in a DAWG
line. Read-only data and instruction misses acquire lines in the memory cache, allowing privileged software to orchestrate
Shared state of the MESI protocol [42] and its variants. way-flushing as part of its software management of protection
Neither a conventional set associative cache nor Intel’s CAT domains. This functionality is exposed for each cache, and
permit duplicating a cache line within a cache: their hardware therefore accommodates systems with diverse hierarchies
enforces a simple invariant that a given tag can only exist in of DAWG caches. We discuss the software mechanism to
a single way of a cache at any time. In the case of a DAWG accommodate dynamic protection domains in Section IV-A2.
cache, the hardware does not strictly enforce this invariant While this manuscript does describe the mechanism to adjust
across protection domains; we allow read-only cache lines (in the DAWG policies in order to create, grow, or shrink protection
Shared state) to be replicated across ways in different protection domains, we leave as future work resource management support
domains. Replicating shared cache lines, however, may leak to securely determine the efficient sizes of protection domains
information via the cache coherence protocol (whereby one for a given workload.
domain can invalidate lines in another), or violate invariants
expected by the cache coherence protocol (by creating a I. Scalability and cache organization
situation where multiple copies of a line exist when one is in Scalability of the number of active protection domains is
the Modified state). a concern with growing number of cores per socket. Since
In order to maintain isolation, cache coherence traffic must performance critical VMs or containers usually require multiple
respect DAWG’s protection domain boundaries. Requests on the cores, however, the maximum number of active domains does
same line from different domains are therefore considered non- not have to scale up to the number of cores.
matching, and are filled by the memory controller. Cache flush DAWG on non-inclusive LLC caches [25] can also assign
instructions (CLFLUSH, CLWB) affect only the ways allocated zero LLC ways to single-core domains, since these do not
to the requesting domain_id. Cross-socket invalidation need communication via a shared cache. Partitioning must be
requests must likewise communicate their originating protection applied to cache coherence metadata, e.g., snoop filters. Private
domain. DAWG caches are not, however, expected to handle a cache partitioning allows a domain per SMT thread.
replicated Modified line, meaning system software must not On inclusive LLC caches the number of concurrently active
allow shared writable pages across protection domains via a domains is limited by the number of ways — for high-core
TLB invariant, as described in Section IV-B2. count CPUs this may require increasing associativity, e.g., from
Stale Shared lines of de-allocated pages may linger in the 20-way to 32-way. Partitioning replacement metadata allows
cache; DAWG must invalidate these before zeroing a page to be high associativity caches with just 1 or 2 bits per tag for
granted to a process (see Section IV-B2). To this end, DAWG metadata to accurately select victims and remain secure.
Consider a cache access that misses in its protection domain: Consider a cache access that misses in its protection domain: Consider the next miss: no “1” NRU bits in protection domain:
? ? ? ? ? ? 0 1 set_metadata ? ? ? ? ? ? 0 0 set_metadata
0 0 0 0 0 0 1 1 nru_mask 0 0 0 0 0 0 1 1 nru_mask
? ? ? ? ? ? 1 set_metadata 0 0 0 0 0 0 0 1 effective metadata 0 0 0 0 0 0 0 0 effective metadata
1 1 1 1 1 1 0 plru_mask (specifies a “default” victim way)
nru_victim nru_victim

x x x 0 x 0 x plru_policy W7 W6 W5 W4 W3 W2 W1 W0 NRU victimizes W7 W6 W5 W4 W3 W2 W1 W0 NRU victimizes


= the first “1” the first “1”
x x x 0 x 1 1 effective metadata Protection domain
victim from the left
Protection domain
victim from the left

for this request for this request if none exist:


0 (Tree-PLRU 1). victimize nru_victim
x x 0 1
Now, update set metadata to record an access on W1 2). set all nru bits to “1”
x x decision tree) ? ? ? ? ? ? 0 0 nru update 1 1 1 1 1 1 1 1 nru update
0 0 0 0 0 0 1 1 nru_mask 0 0 0 0 0 0 1 1 nru_mask
W7 W6 W5 W4 W3 W2 W1 W0 ? ? ? ? ? ? 0 0 new set_metadata ? ? ? ? ? ? 1 1 new set_metadata

Protection domain victim


for this request Fig. 6. Victim selection and metadata update with a DAWG-partitioned NRU
Now, update set metadata to record an access on W1 policy.
x x x 1 x 1 0 plru update
0 0 0 0 0 0 1 ~plru_mask
? ? ? ? ? ? 0 new set_metadata
and use values 0 and 1, respectively.
Fig. 5. Victim selection and metadata update with a DAWG-partitioned Observe that both are straightforwardly implemented via
Tree-PLRU policy. plru_mask and plru_policy. This forces a subset of
decision tree bits, as specified by the policy: victim se-
J. Replacement policies lection logic uses ( (set_metadata & ˜plru_mask)
In this section, we will exemplify the implementation of | (plru_mask & plru_policy) ). This ensures that
several common replacement policies compatible with DAWG’s system software is able to restrict victim selection to a subtree
isolation requirement. We focus here on several commonplace over the cache ways. Metadata updates are partitioned also, by
replacement policies, given that cache replacement policies constraining updates to set_metadata & ˜plru_mask.
are diverse. The optimal policy for a workload depends on When system software alters the cache’s policies, and re-assigns
the effective associativity and may even be software-selected, a way to a different protection domain, it must take care to
e.g., ARM A72 [1] allows pseudo-random or pseudo-LRU force the way’s metadata to a known value in order to avoid
cache-replacement. private information leakage.
1) Tree-PLRU: pseudo least recently used: Tree-PLRU 2) SRRIP and NRU: Not recently used: An NRU policy
“approximates” LRU with a small set of bits stored per cache requires one bit of metadata per way be stored with each set.
line. The victim selection is a (complete) decision tree informed On a cache hit, the accessed way’s NRU bit is set to “0”. On
by metadata bits. 1 signals “go left”, whereas 0 signals “go a cache miss, the victim is the first (according to some pre-
right” to reach the PLRU element, as shown in Fig. 5. determined order, such as left-to-right) line with a “1” NRU
The cache derives plru_mask and plru_policy from bit. If none exists, the first line is victimized, and all NRU bits
policy_fillmap. These fields augment a decision tree of the set are set to “1”.
over the ways of the cache; a bit of plru_mask is 0 if
Enforcing DAWG’s isolation across protection domains for
and only if its corresponding subtree in policy_fillmap
an NRU policy is a simple matter, as shown in Fig. 6. As
has no zeroes (if the subtree of the decision tree is entirely
before, metadata updates are restricted to the ways white-listed
allocated to the protection domain). Similarly, plru_policy
by nru_mask = policy_fillmap. In order to victimize
bits are set if their corresponding left subtrees contain
only among ways white-listed by the policy, mask the NRU
one or more ways allocated to the protection domain.
bits of all other ways via set_metadata & nru_mask at
For example, if a protection domain is allocated ways
the input to the NRU replacement logic.
W0 , W1 of 8 ways, then plru_mask=0b11111110, and
plru_policy=0bxxx0x0x (0b0000001, to be precise, Instead of victimizing the first cache line if no “1” bits are
with x marking masked and unused bits). found, the victim way must fall into the current protection
At each access, set_metadata is updated by changing domain. To implement this, the default victim is specified
each bit on the branch leading to the hitting way to be the via nru_victim, which selects the leftmost way with a
opposite of the direction taken, i.e., “away” from the most corresponding “1” bit of nru_mask, whereas the unmodified
recently used way. For example, when accessing W5 , metadata NRU is hard-wired to evict a specific way.
bits are updated by b0 → 0, b2 → 1, b5 → 0. These updates The SRRIP [26] replacement policy is similar, but expands
are masked to avoid modifying PLRU bits above the allocated the state space of each line from two to four (or more) states
subtree. For example, when {W2 , W3 } are allocated to the by adding a counter to track ways less favored to be victimized.
process, and it hits W3 , b0 and b1 remain unmodified to avoid Much like NRU, SRRIP victimizes the first (up to some
leaking information via the metadata updates. pre-determined order) line with the largest counter during
Furthermore, we must mask set_metadata bits that a fill that requires eviction. To partition SRRIP, the same
are made irrelevant by the allocation. For example, when nru_mask = policy_fillmap is used, where each line’s
{W2 , W3 } are allocated to the process, the victim selection metadata is masked with the way’s bit of nru_mask to ensure
should always reach the b4 node when searching for the pseudo- other domains’ lines are considered “recently used” and not
LRU way. To do this, ignore {b0 , b1 } in the metadata table, candidates for eviction.
IV. S OFTWARE M ODIFICATIONS 3) Code Prioritization: Programming the domain selectors
for code and data separately allows ways to be dedicated to
We describe software provisions for modifying DAWG’s code without data interference. Commercial studies of code
protection domains, and also describe small, required modifi- cache sensitivity of production server workloads [25], [27], [41]
cations to several well-annotated sections of kernel software show large instruction miss rates in L2, but even the largest
to implement cross-domain communication primitives robust code working sets fit within 1–2 L3 ways. Code prioritization
against speculative execution attacks. will also reduce the performance impact of disallowing code
sharing across domains, especially when context switching
A. Software management of DAWG policies between untrusted domains sharing code.
Protection domains are a software abstraction implemented B. Kernel changes required by DAWG
by system software via DAWG’s policy MSRs. The policy
Consider a likely configuration where a user-mode appli-
MSRs themselves (a table mapping protection domain_id to
cation and the OS kernel are in different protection domains.
a policy_hitmap and policy_fillmap at each cache,
In order to perform a system call, communication must occur
as described in Section III-B) reside in the DAWG cache
across the protection domains: the supervisor extracts the
hardware, and are atomically modified.
(possibly cached) data from the caller by copying into its
1) DAWG Resource Allocation: Protection domains for a own memory. In DAWG, this presents a challenge due to
process tree should be specified using the same cgroup- strong isolation in the cache.
like interface as Intel’s CAT. In order to orchestrate DAWG’s 1) DAWG augments SMAP-annotated sections: We take
protection domains and policies, the operating system must advantage of modern OS support for the Supervisor Mode
track the mapping of process IDs to protection domains. In a Access Prevention (SMAP) feature available in recent x86
system with 16 ways in the most associative cache, no more architectures, which allows supervisor mode programs to raise
than 16 protection domains can be concurrently scheduled, a trap on accesses to user-space memory. The intent is to harden
meaning if the OS has need for more mutually distrusting the kernel against malicious programs attempting to influence
entities to schedule, it needs to virtualize protection domains privileged execution via untrusted user-space memory. At each
by time-multiplexing protection domain IDs, and flushing the routine where supervisor code intends to access user-space
ways of the multiplexed domain whenever it is re-allocated. memory, SMAP must be temporarily disabled and subsequently
Another data structure, dawg_policy, tracks the resources re-enabled via stac (Set AC Flag) and clac (Clear AC Flag)
(cache ways) allocated to each protection domain. This is instructions, respectively. We observe that a modern kernel’s
a table mapping domain_id to pairs (policy_hitmap, interactions with user-space memory are diligently annotated
policy_fillmap) for each DAWG cache. The kernel uses with these instructions, and will refer to these sections as
this table when resizing, creating, or destroying protection annotated sections.
domains in order to maintain an exclusive allocation of ways Currently Linux kernels use seven such sections for sim-
to each protection domain. Whenever one or more ways are re- ple memory copy or clearing routines: copy_from_user,
allocated, the supervisor must look up the current domain_id copy_to_user, clear_user, futex, etc. We propose
of the owner, accomplished via either a search or a persistent extending these annotated sections with short instruction
inverse map cache way to domain_id. sequences to correctly handle DAWG’s communication re-
2) Secure Dynamic Way Reassignment: When modifying an quirements on system calls and inter-process communication,
existing allocation of ways in a DAWG cache (writing policy in addition to the existing handling of the SMAP mechanism.
MSRs), as necessary to create or modify protection domains, Specifically, sections implementing data movement from user
system software must sanitize (including any replacement to kernel memory are annotated with an MSR write to
metadata, as discussed in Section III-E) the re-allocated way(s) domain_id: ifetch and store accesses proceed on behalf of
before they may be granted to a new protection domain. The the kernel, as before, but load accesses use the caller’s (user)
process for re-assigning cache way(s) proceeds as follows: protection domain. This allows the kernel to efficiently copy
1) Update the policy_fillmap MSRs to disallow fills in from warm cache lines, but preserves isolation. After copying
the way(s) being transferred out of the shrinking domain. from the user, the domain_id MSR is restored to perform all
2) A software loop iterates through the cache’s set indexes accesses on behalf of the kernel’s protection domain. Likewise,
and flushes all sets of the re-allocated way(s). The sections implementing data movement to user memory ifetch
shrinking domain may hit on lines yet to be flushed, and load on behalf of the kernel’s domain, but store in the user’s
as policy_hitmap is not yet updated. cache ways. While the annotated sections may be interrupted
3) Update the policy_hitmap MSRs to exclude ways to by asynchronous events, interrupt handlers are expected to
be removed from the shrinking protection domain. explicitly set domain_id to the kernel’s protection domain,
4) Update the policy_hitmap and policy_fillmap and restore the MSR to its prior state afterwards.
MSRs to grant the ways to the growing protection domain. As described in Section III-C, DAWG’s domain_id MSR
Higher level policies can be built on this dynamic way- writes are a fence, preventing speculative disabling of DAWG’s
reassignment mechanism. protection mechanism. Current Linux distributions diligently
pair a stac instruction with an lfence instruction to prevent of our attack schema and point out the limitations of cache
speculative execution within regions that access user-mode partitioning in Section V-C.
memory, meaning DAWG does not significantly serialize
annotated sections over its insecure baseline. A. DAWG Isolation Property
Finally, to guarantee isolation, we require the annotated DAWG enforces isolation of exclusive protection domains
sections to contain only code that obeys certain properties: to among cache tags and replacement metadata, as long as:
protect against known and future speculative attacks, indirect 1) victim selection is restricted to the ways allocated to the
jumps or calls, and potentially unsafe branches are not to be protection domain (an invariant maintained by system
used. Further, we cannot guarantee that these sections will not software), and
require patching as new attacks are discovered, although this 2) metadata updates as a result of an access in one domain
is reasonable given the small number and size of the annotated do not affect victim selection in another domain (a
sections. requirement on DAWG’s cache replacement policy).
2) Read-only and CoW sharing across domains: For mem- Together, this guarantees non-interference – the hits and
ory efficiency, DAWG allows securely mapping read-only pages misses of a program running in one protection domain are
across protection domains, e.g., for shared libraries, requiring unaffected by program behavior in different protection domains.
hardware cache coherence protocol changes (see Section III-G), As a result, DAWG blocks the cache tag and metadata chan-
and OS/hypervisor support. nels of non-communicating processes separated by DAWG’s
This enables conventional system optimizations via page protection domains.
sharing, such as read-only mmap from page caches, Copy-
on-Write (CoW) conventionally used for fork, or for page B. No leaks from system calls
deduplication across VMs (e.g., Transparent Page Sharing [54]; Consider a case where the kernel (victim) and a user program
VM page sharing is typically disabled due to concerns raised (attacker) reside in different protection domains. While both
by shared cache tag attacks [24]). DAWG maintains security use the same cache hierarchy, they share neither cache lines
with read-only mappings across protection domains to maintain nor metadata (Section III-B), effectively closing the cache
memory efficiency. exfiltration channel. In few, well-defined instances where data
Dirty pages can be prepared for CoW sharing eagerly, or is passed between them (such as copy_to_user), the kernel
lazily (but cautiously [57]) by installing non-present pages accesses the attacker’s ways to read/write user memory, leaking
in the consumer domain mapping. Preparing a dirty page the (public) access pattern associated with the bulk copying
for sharing requires a write-back of any dirty cache lines of the syscall inputs and outputs (Section IV-B). Writes to
on behalf of the producer’s domain (via CLWB instructions and DAWG’s MSRs are fences, and the annotated sections must
an appropriate load domain_id). The writeback guarantees not offer opportunity to maliciously mis-speculate control flow
that read-only pages appear only as Shared lines in DAWG (see Section IV-B1), thwarting speculative disabling or misuse
caches, and can be replicated across protection domains as of DAWG’s protection domains. DAWG also blocks leaks via
described in Section III-G. coherence Metadata, as coherence traffic is restricted to its
A write to a page read-only shared across protection domains protection domain (Section III-G), with the sole exception of
signals the OS to create a new, private copy using the cross-domain invalidation, where physical pages are reclaimed
original producer’s domain_id for reads, and the consumer’s and sanitized.
domain_id for writes. When re-allocating cache ways, as part of resizing or
3) Reclamation of shared physical pages: Before cache lines multiplexing protection domains, no private information is
may be filled in a new protection domain, pages reclaimed from transferred to the receiving protection domain: the kernel
a protection domain must be removed from DAWG caches as sanitizes ways before they are granted, as described in
part of normal OS page cleansing. Prior to zeroing (or preparing Section IV-A2. Physical pages re-allocated across protection
for DMA) a page previously shared across protection domains, domains are likewise sanitized (Section IV-B3).
the OS must invalidate all cache lines belonging to the page, When an application makes a system call, the necessary com-
as described in Section III-G. The same is required between munication (data copying) between kernel and user program
unmap and mmap operations over the same physical addresses. must not leak information beyond what is communicated. The
For most applications, therefore cache line invalidation can OS’s correct handling of domain_id MSR within annotated
be deferred to wholesale destruction of protection domains at sections, as described in Section IV-B ensures user space cache
exit, given ample physical memory. side effects reflect the section’s explicit memory accesses.

V. S ECURITY A NALYSIS C. Limitations of cache partitioning


We explain why DAWG protects against attacks realized DAWG’s cache isolation goals are meant to approach the
thus far on speculative execution processors by stating and isolation guarantees of separate machines, yet, even remote
arguing a non-interference property in Section V-A. We then network services can fall victim to leaks employing cache
argue in Section V-B that system calls and other cross-domain tag state for communication. Consider the example of the
communication are safe. Finally, we show a generalization attacker and victim residing in different protection domains,
secret is passed indirectly, syscall completion
via timing of cache accesses, time channel, not closed TABLE I
within a single protection domain by DAWG caches S IMULATED SYSTEM SPECIFICATIONS .
victim’s protection domain attacker’s protection domain
(kernel) (malicious unprivileged app)
Cores DRAM Bandwidth
secret syscall syscall receiver secret
1 2 Count Frequency Controllers Peak
affects affected by 8 OoO 3 GHz 4 x DDR3-1333 42 GB/s
cache state attacker orchestrates syscalls to infer Private Caches Shared Cache
secret via syscall completion time
L1 L2 Organization L3 Organization
syscalls interact via the cache; latency of 2nd syscall depends on accesses made by 1st 2× 32 KB 256 KB 8-way PLRU 8× 2 MB 16-way NRU

Fig. 7. Generalized Attack Schema: an adversary 1) accesses a victim’s secret,


2) reflects it to a transmitter 3) transmits it via a covert channel, 4) receives it VI. E VALUATION
in their own protection domain.
To evaluate DAWG, we use the zsim [48] execution-
driven x86-64 simulator and Haswell hardware [15] for our
sharing no data, but communicating via some API, such as experiments.
system calls. As in a remote network timing leak [10], where
network latency is used to communicate some hidden state in A. Configuration of insecure baseline
the victim, the completion time of API calls can communicate Table I summarizes the characteristics of the simulated
insights about the cache state [31] within a protection domain. environment. The out-of-order model implemented by zsim
Leakage via reflection through the cache is thus possible: the is calibrated against Intel Westmere, informing our choice of
receiver invokes an API call that accesses private information, cache and network-on-chip latencies. The DRAM configuration
which affects the state of its private cache ways. The receiver is typical for contemporary servers at ∼5 GB/s theoretical
then exfiltrates this information via the latency of another DRAM bandwidth per core. Our baseline uses the Tree-
API call. Fig. 7 shows a cache timing leak which relies PLRU (Section III-J1) replacement policy for private caches,
on cache reflection entirely within the victim’s protection and a 2-bit NRU for the shared LLC. The simulated model
domain. The syscall completion time channel is used for implements inclusive caches, although DAWG domains with
exfiltration, meaning no private information crosses DAWG’s reduced associativity would benefit from relaxed inclusion [25].
domain boundaries in the caches, rendering DAWG, and cache We simulate CAT partitioning at all levels of the cache, while
partitioning in general, ineffective at closing a leak of this type. modern hardware only offers this at the LLC. We do this by
The transmitter is instructed via an API call to access a[b[i]], restricting the replacement mask policy_fillmap, while
where i is provided by the receiver (via syscall1), while white-listing all ways via the policy_hitmap.
a, b reside in the victim’s protection domain. The cache tag B. DAWG Policy Scenarios
state of the transmitter now reflects b[i], affecting the latency
of subsequent syscalls in a way dependent on the secret b[i]. We evaluate several protection domain configurations for
The receiver now exfiltrates information about b[i] by selecting different resource sharing and isolation scenarios.
a j from the space of possible values of b[i] and measuring 1) VM or container isolation on dedicated cores: Isolating
the completion time of syscall2, which accesses a[j]. The peer protection domains from one another requires equitable
syscall completion time communicates whether the transmitter LLC partitioning, e.g., 50% of ways allocated to two active
hits on a[j], which implies a[j] ∼ domains. In the case of cores dedicated to each workload
= a[b[i]], and for a compact a,
that b[i] = j – a leak. This leak can be amplified by initializing (no context switches), each scheduled domain is assigned the
cache state via a bulk memory operation, and, for a machine- entirety of its L1 and L2.
local receiver by malicious mis-speculation. 2) VM or container isolation on time-shared cores: To allow
the OS to overcommit cores across protection domains (thus
While not the focus of this paper, for completeness, we requiring frequent context switches between domains), we also
outline a few countermeasures for this type of leak. Observe evaluate a partitioned L2 cache.
that the completion time of a public API is used here to 3) OS isolation: Only two DAWG domains are needed to
exfiltrate private information. The execution time of a syscall isolate an OS from applications. For processes with few OS
can be padded to guarantee constant (and worst-case) latency, interventions in the steady state, e.g., SPECCPU workloads,
no matter the input or internal state. This can be relaxed to the OS can reserve a single way in the LLC, and flush L1 and
bound the leak to a known number of bits per access [17]. L2 ways to service the rare system calls. Processes utilizing
A zero leak countermeasure requires destroying the trans- more OS services would benefit from more ways allocated to
mitting domain’s cache state across syscalls/API invocations, OS’s domain.
preventing reflection via the cache. DAWG can make this
less inefficient: in addition to dynamic resizing, setting the C. DAWG versus insecure baseline
replacement mask policy_fillmap to a subset of the Way partitioning mechanisms reduce cache capacity and
policy_hitmap allows locking cache ways to preserve associativity, which increases conflict misses, but improves
the hot working set. This ensures that all unique cache lines fairness and reduces contention. We refer to CAT [21] for
accessed during one request have constant observable time. analysis of the performance impact of way partitioning on a
Slowdown 1.2

1.1
8/16 ways 9/16 11/16 13/16 15/16 16/16 ways
1
15
bc pr tc

Cycles / Edge (K)


0.9

0.8 10
234578 234578 234578 234578 234578 234578
blackscholes facesim fluidanimate freqmine raytrace x264 5

Ways allocated
0
12 13 14 15 16 17 18 19 20 12 13 14 15 16 17 18 19 20 12 13 14 15 16 17 18 19 20
Fig. 8. Way partitioning performance at low associativity in all caches (8-way
L1, 8-way L2, and 16-way L3). 3
bfs cc sssp

Cycles / Edge (K)


subset of SPEC CPU2006. Here, we evaluate CAT and DAWG 2

on parallel applications from PARSEC [6], and parallel graph


1
applications from the GAP Benchmark Suite (GAPBS) [4],
which allows a sweep of of workload sizes. 0
12 13 14 15 16 17 18 19 20 12 13 14 15 16 17 18 19 20 12 13 14 15 16 17 18 19 20
Fig. 8 shows DAWG partitioning of private L1 and
Graph Size (log N)
L2 (Section VI-B2) caches in addition to the L3. We
explore DAWG configurations on a subset of PARSEC
benchmarks on simlarge workloads. The cache insensitive Fig. 9. Way partitioning performance with varying working set on graph
applications. Simulated 16-way L3.
blackscholes (or omitted swaptions with 0.001 L2
MPKI (Misses Per 1000 Instructions)) are unaffected at any
way allocation. For a VM isolation policy (Section VI-B1)
with 8/16 of the L3, even workloads with higher MPKI such as Private vs Shared (Haswell)
1.2
facesim show at most 2% slowdown. The h2/8 L2, 2/16 L3i bc pr tc
configuration is affected by both capacity and associativity 1.1
Slowdown

reductions, yet most benchmarks have 4–7% slowdown, up to 1

12% for x264. Such an extreme configuration can accommodate 0.9


4 very frequently context switched protection domains.
0.8
Fig. 9 shows the performance of protection domains using 15 16 17 18 19 20 21 22 23 15 16 17 18 19 20 21 22 23 15 16 17 18 19 20 21 22 23

different fractions of an L3 cache for 4-thread instances of


1.2
graph applications from GAPBS. We use variable size synthetic bfs cc sssp
power law graphs [12], [18] that match the structure of real- 1.1
Slowdown

world social and web graphs and therefore exhibit cache 1

locality [5]. The power law structure, however, implies that 0.9
there is diminishing return from each additional L3 way. As 0.8
shown, at half cache capacity (8/16 L3, Section VI-B1), there 15 16 17 18 19 20 21 22 23 15 16 17 18 19 20 21 22 23 15 16 17 18 19 20 21 22 23

is at most 15% slowdown (bc and tc benchmarks) at the Graph Size (log N)
20
largest simulated size (2 vertices). A characteristic eye is
formed when the performance curves of different configurations Fig. 10. Read-only sharing effects of two instances using Shared vs Private
cross over the working set boundary (e.g., graph size of 217 ). data of varying scale (1-thread instances). Actual Haswell 20-way 30 MB L3.
Performance with working sets smaller or larger than the
effective cache capacity is unaffected — at the largest size cc, while CAT may lose block history effectively exhibiting random
pr, and sssp show 1–4% slowdown. replacement – a minor, workload-dependent perturbation. In
Reserving for the OS (Section VI-B3), one way (6% of LLC simulations (not shown), we replicate a known observation that
capacity) adds no performance overhead to most workloads. random replacement occasionally performs better than LRU
The only exception would be a workload caught in the eye, near cache capacity. We did not observe this effect with NRU
e.g., PageRank at 217 has 30% overhead (Fig. 9), while at 216 replacement.
or 218 — 0% difference. 2) Read-only Sharing: CAT QoS guarantees a lower bound
on a workload’s effective cache capacity, while DAWG isolation
D. CAT versus DAWG forces a tight upper bound. DAWG’s isolation reduces cache
We analyze and evaluate scenarios based on the degree of capacity compared to CAT when cache lines are read-only
code and data sharing across domains. shared across mutually untrusting protection domains. CAT
1) No Sharing: There is virtually no performance differ- permits hits across partitions where code or read-only data are
ence between secure DAWG partitioning, and insecure CAT unsafely shared. We focus on read-only data in our evaluation,
partitioning in the absence of read-sharing across domains. as benchmarks with low L1i MPKI like GAPBS, PARSEC, or
DAWG reduces interference in replacement metadata updates SPECCPU are poorly suited to study code cache sensitivity.
and enforces the intended replacement strategy within a domain, We analyze real applications using one line modifications
to GAPBS to fork (a single-thread process) either before or We are grateful to Carl Waldspurger for his valuable feedback
after creating in-memory graph representations. The first results on the initial design as well as the final presentation of this
in a private graph for each process, while the latter simulates paper. We also thank our anonymous reviewers and Julian Shun
mmap of a shared graph. The shared graphs access read-only for helpful questions and comments.
data across domains in the baseline and CAT, while DAWG R EFERENCES
has to replicate data in domain-private ways. Since zsim does
[1] ARM, “ARM Cortex-A72 MPCore processor technical reference manual,”
not simulate TLBs, we ensure different virtual addresses are 2015.
used to avoid false sharing. We first verified in simulation [2] ARM, “ARM Software Speculation Barrier,” https://github.com/ARM-
that DAWG, with memory shared across protection domains, software/speculation-barrier, January 2018.
[3] S. Banescu, “Cache timing attacks,” 2011, [Online; accessed 26-January-
behaves identically to CAT and the baseline with private data. 2014].
Next, we demonstrate (in Fig. 10) that these benchmarks [4] S. Beamer, K. Asanović, and D. A. Patterson, “The GAP
show little performance difference on real hardware [15] for benchmark suite,” CoRR, vol. abs/1508.03619, 2015. [Online]. Available:
http://arxiv.org/abs/1508.03619
most data sizes; Shared baseline models Shared CAT, while [5] S. Beamer, K. Asanović, and D. A. Patterson, “Locality exists in graph
Private baseline models Shared DAWG. The majority of cycles processing: Workload characterization on an Ivy Bridge server,” in 2015
are spent on random accesses to read-write data, while read- IEEE International Symposium on Workload Characterization, IISWC
2015, Atlanta, GA, USA, October 4-6, 2015, 2015, pp. 56–65.
only data is streamed sequentially. Although read-only data is [6] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation,
much larger than read-write data (e.g., 16 times more edges Princeton University, January 2011.
than vertices), prefetching and scan- and thrash- resistant [7] J. Bonneau and I. Mironov, “Cache-collision timing attacks against
AES,” in Cryptographic Hardware and Embedded Systems-CHES 2006.
policies [26], [45] further reduce the need for cache resident Springer, 2006, pp. 201–215.
read-only data. Note that even at 223 vertices these effects are [8] B. B. Brumley and N. Tuveri, “Remote timing attacks are still practical,”
immaterial; real-world graphs have billions of people or pages. in Computer Security–ESORICS. Springer, 2011.
[9] D. Brumley and D. Boneh, “Remote timing attacks are practical,”
Computer Networks, 2005.
E. Domain copy microbenchmark [10] D. Brumley and D. Boneh, “Remote timing attacks are practical,”
Computer Networks, 2005.
We simulated a privilege level change at simulated system [11] C. Carruth, “Introduce the ”retpoline” x86 mitigation technique for variant
calls for user-mode TCP/IP. Since copy_from_user and #2 of the speculative execution vulnerabilities,” http://lists.llvm.org/
copy_to_user permit hits in the producer’s ways, there is pipermail/llvm-commits/Week-of-Mon-20180101/513630.html, January
2018.
no performance difference against the baseline (not shown). [12] D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A recursive model
for graph mining,” in Proceedings of the Fourth SIAM International
VII. C ONCLUSION Conference on Data Mining, Lake Buena Vista, Florida, USA, April
22-24, 2004, 2004, pp. 442–446.
DAWG protects against attacks that rely on a cache state- [13] J. Corbet, “KAISER: hiding the kernel from user space,” https://lwn.net/
Articles/738975/, November 2017.
based channel, which are commonly referred to as cache-timing [14] L. Domnitser, A. Jaleel, J. Loew, N. Abu-Ghazaleh, and D. Ponomarev,
attacks, on speculative execution processors with reasonable “Non-monopolizable caches: Low-complexity mitigation of cache side
overheads. The same policies can be applied to any set- channel attacks,” Transactions on Architecture and Code Optimization
(TACO), 2012.
associative structure, e.g., TLB or branch history tables. DAWG [15] E5v3, “Intel Xeon Processor E5-2680 v3(30M Cache, 2.50
has its limitations and additional techniques are required to GHz),” http://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-
block exfiltration channels different from the cache channel. 2680-v3-30M-Cache-2 50-GHz.
[16] N. El-Sayed, A. Mukkara, P.-A. Tsai, H. Kasture, X. Ma, and D. Sanchez,
We believe that techniques like DAWG are needed to restore “KPart: A hybrid cache partitioning-sharing technique for commodity
our confidence in public cloud infrastructure, and hardware and multicores,” in Proceedings of the 24th international symposium on High
software co-design will help minimize performance overheads. Performance Computer Architecture (HPCA-24), February 2018.
[17] C. W. Fletcher, L. Ren, X. Yu, M. V. Dijk, O. Khan, and S. Devadas,
A good proxy for the performance overheads of secure “Suppressing the oblivious RAM timing channel while making infor-
DAWG is Intel’s existing, though insecure, CAT hardware. mation leakage and program efficiency trade-offs,” in 2014 IEEE 20th
Traditional QoS uses of CAT, however, differ from desired International Symposium on High Performance Computer Architecture
(HPCA), Feb 2014, pp. 213–224.
DAWG protection domains’ configurations. Research on soft- [18] Graph500, “Graph 500 benchmark,” http://www.graph500.org/
ware resource management strategies can therefore commence specifications.
with evaluation of large scale workloads on CAT. CPU vendors [19] D. Gruss, J. Lettner, F. Schuster, O. Ohrimenko, I. Haller, and M. Costa,
“Strong and efficient cache side-channel protection using hardware
can similarly analyze the cost-benefits of increasing cache transactional memory,” in 26th USENIX Security Symposium (USENIX
capacity and associativity to accommodate larger numbers of Security 17). Vancouver, BC: USENIX Association, 2017, pp. 217–233.
active protection domains. [Online]. Available: https://www.usenix.org/conference/usenixsecurity17/
technical-sessions/presentation/gruss
[20] D. Gruss, C. Maurice, K. Wagner, and S. Mangard, “Flush+Flush: a fast
VIII. ACKNOWLEDGMENTS and stealthy cache attack,” in International Conference on Detection of
Intrusions and Malware, and Vulnerability Assessment. Springer, 2016,
Funding for this research was partially provided by NSF pp. 279–299.
grant CNS-1413920; DARPA contracts HR001118C0018, [21] A. Herdrich, E. Verplanke, P. Autee, R. Illikkal, C. Gianos, R. Singhal, and
HR00111830007, and FA87501720126; Delta Electronics, R. Iyer, “Cache QoS: From concept to reality in the Intel Xeon processor
E5-2600 v3 product family,” in 2016 IEEE International Symposium
DARPA & SPAWAR contract N66001-15-C-4066; DoE award on High Performance Computer Architecture (HPCA), March 2016, pp.
DE-FOA0001059, and Toyota grant LP-C000765-SR. 657–668.
[22] J. Horn, “Reading privileged memory with a side-channel,” https: Can you have it both ways?” in Proceedings of the 48th
//googleprojectzero.blogspot.com/2018/01/, January 2018. International Symposium on Microarchitecture, ser. MICRO-48. New
[23] Intel Corp., “Improving real-time performance by utilizing Cache York, NY, USA: ACM, 2015, pp. 1–12. [Online]. Available:
Allocation Technology,” April 2015. http://doi.acm.org/10.1145/2830772.2830773
[24] G. Irazoqui, M. S. Inci, T. Eisenbarth, and B. Sunar, “Wait a minute! a ser. ISCA ’07. New York, NY, USA: ACM, 2007, pp. 381–391.
fast, cross-VM attack on AES,” in International Workshop on Recent [Online]. Available: http://doi.acm.org/10.1145/1250662.1250709
Advances in Intrusion Detection. Springer, 2014, pp. 299–319. [46] Richard Grisenthwaite, “Cache Speculation Side-channels,” January 2018.
[25] A. Jaleel, J. Nuzman, A. Moga, S. C. Steely, and J. Emer, “High
[47] D. Sanchez and C. Kozyrakis, “Vantage: Scalable and efficient fine-
performing cache hierarchies for server workloads: Relaxing inclusion
grain cache partitioning,” in 38th Annual International Symposium on
to capture the latency benefits of exclusive caches,” in 2015 IEEE 21st
Computer Architecture (ISCA), June 2011, pp. 57–68.
International Symposium on High Performance Computer Architecture
(HPCA), Feb 2015, pp. 343–353. [48] D. Sanchez and C. Kozyrakis, “ZSim: Fast and accurate microarchitectural
[26] A. Jaleel, K. B. Theobald, S. C. S. Jr., and J. S. Emer, “High simulation of thousand-core systems,” in Proceedings of the 40th
performance cache replacement using re-reference interval prediction Annual International Symposium on Computer Architecture-ISCA, vol. 13.
(RRIP),” in 37th International Symposium on Computer Architecture Association for Computing Machinery, 2013, pp. 23–27.
(ISCA 2010), June 19-23, 2010, Saint-Malo, France, 2010, pp. 60–71. [49] R. Sprabery, K. Evchenko, A. Raj, R. B. Bobba, S. Mohan,
[Online]. Available: http://doi.acm.org/10.1145/1815961.1815971 and R. H. Campbell, “A novel scheduling framework leveraging
[27] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, hardware cache partitioning for cache-side-channel elimination in
G. Y. Wei, and D. Brooks, “Profiling a warehouse-scale computer,” in clouds,” CoRR, vol. abs/1708.09538, 2017. [Online]. Available:
2015 ACM/IEEE 42nd Annual International Symposium on Computer http://arxiv.org/abs/1708.09538
Architecture (ISCA), June 2015, pp. 158–169. [50] G. Taylor, P. Davies, and M. Farmwald, “The TLB slice - a low-cost high-
[28] M. Kayaalp, K. N. Khasawneh, H. A. Esfeden, J. Elwell, N. Abu- speed address translation mechanism,” SIGARCH Computer Architecture
Ghazaleh, D. Ponomarev, and A. Jaleel, “Ric: Relaxed inclusion caches News, 1990.
for mitigating llc side-channel attacks,” in 2017 54th ACM/EDAC/IEEE [51] L. Torvalds, “Re: Page colouring,” 2003. [Online]. Available:
Design Automation Conference (DAC), June 2017, pp. 1–6. http://yarchive.net/comp/linux/cache coloring.html
[29] R. E. Kessler and M. D. Hill, “Page placement algorithms for large
real-indexed caches,” Transactions on Computer Systems (TOCS), 1992. [52] C. Trippel, D. Lustig, and M. Martonosi, “MeltdownPrime and Spec-
[30] V. Kiriansky and C. Waldspurger, “Speculative buffer overflows: Attacks trePrime: Automatically-synthesized attacks exploiting invalidation-based
and defenses,” ArXiv e-prints, Jul. 2018. coherence protocols,” arXiv preprint arXiv:1802.03802, 2018.
[31] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, [53] P. Turner, “Retpoline: a software construct for preventing branch-
S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, “Spectre attacks: target-injection,” https://support.google.com/faqs/answer/7625886, Jan-
Exploiting speculative execution,” ArXiv e-prints, Jan. 2018. uary 2018.
[32] P. C. Kocher, “Timing attacks on implementations of Diffie-Hellman, [54] C. A. Waldspurger, “Memory resource management in VMware
RSA, DSS, and other systems,” in Advances in Cryptology (CRYPTO). ESX server,” in Proceedings of the 5th Symposium on Operating
Springer, 1996. Systems Design and implementationCopyright Restrictions Prevent
[33] J. Kong, O. Aciicmez, J.-P. Seifert, and H. Zhou, “Deconstructing new ACM from Being Able to Make the PDFs for This Conference
cache designs for thwarting software cache-based side channel attacks,” Available for Downloading, ser. OSDI ’02. Berkeley, CA, USA:
in workshop on Computer security architectures. ACM, 2008. USENIX Association, 2002, pp. 181–194. [Online]. Available:
[34] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan, “Gaining http://dl.acm.org/citation.cfm?id=1060289.1060307
insights into multicore cache partitioning: Bridging the gap between [55] Y. Wang, A. Ferraiuolo, D. Zhang, A. C. Myers, and G. E. Suh, “SecDCP:
simulation and real systems,” in HPCA. IEEE, 2008. Secure dynamic cache partitioning for efficient timing channel protection,”
[35] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, S. Mangard, in Proceedings of the 53rd Annual Design Automation Conference, ser.
P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg, “Meltdown,” ArXiv DAC ’16. New York, NY, USA: ACM, 2016, pp. 74:1–74:6. [Online].
e-prints, Jan. 2018. Available: http://doi.acm.org/10.1145/2897937.2898086
[36] F. Liu, Q. Ge, Y. Yarom, F. Mckeen, C. Rozas, G. Heiser, and R. B.
Lee, “CATalyst: Defeating last-level cache side channel attacks in cloud [56] Z. Wang and R. B. Lee, “New cache designs for thwarting software cache-
computing,” in HPCA, Mar 2016. based side channel attacks,” in International Symposium on Computer
[37] F. Liu and R. B. Lee, “Random fill cache architecture,” in Microarchi- Architecture (ISCA), 2007.
tecture (MICRO). IEEE, 2014. [57] Y. Xu, W. Cui, and M. Peinado, “Controlled-channel attacks: Deter-
[38] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last-level cache ministic side channels for untrusted operating systems,” in 2015 IEEE
side-channel attacks are practical,” in Security and Privacy. IEEE, 2015. Symposium on Security and Privacy, May 2015, pp. 640–656.
[39] Y. Oren, V. P. Kemerlis, S. Sethumadhavan, and A. D. Keromytis, “The [58] M. Yan, B. Gopireddy, T. Shull, and J. Torrellas, “Secure
spy in the sandbox – practical cache attacks in javascript,” arXiv preprint hierarchy-aware cache replacement policy (sharp): Defending against
arXiv:1502.07373, 2015. cache-based side channel atacks,” in Proceedings of the 44th Annual
[40] D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacks and counter- International Symposium on Computer Architecture, ser. ISCA ’17.
measures: the case of AES,” in Topics in Cryptology–CT-RSA 2006. New York, NY, USA: ACM, 2017, pp. 347–360. [Online]. Available:
Springer, 2006, pp. 1–20. http://doi.acm.org/10.1145/3079856.3080222
[41] G. Ottoni and B. Maher, “Optimizing function placement for large-scale [59] F. Yao, M. Doroslovacki, and G. Venkataramani, “Are coherence protocol
data-center applications,” in 2017 IEEE/ACM International Symposium states vulnerable to information leakage?” in 2018 IEEE International
on Code Generation and Optimization (CGO), Feb 2017, pp. 233–244. Symposium on High Performance Computer Architecture (HPCA), Feb
[42] M. S. Papamarcos and J. H. Patel, “A low-overhead coherence solution 2018, pp. 168–179.
for multiprocessors with private cache memories,” SIGARCH Comput.
Archit. News, vol. 12, no. 3, pp. 348–354, Jan. 1984. [60] Y. Yarom and K. Falkner, “FLUSH+RELOAD: A high resolution, low
[43] A. Pardoe, “Spectre mitigations in MSVC,” https://blogs.msdn.microsoft. noise, L3 cache side-channel attack.” in USENIX Security Symposium,
com/vcblog/2018/01/15/spectre-mitigations-in-msvc/, January 2018. 2014.
[44] B. Pham, J. Veselý, G. H. Loh, and A. Bhattacharjee, “Large pages [61] X. Zhang, S. Dwarkadas, and K. Shen, “Towards practical page
and lightweight memory management in virtualized environments: coloring-based multicore cache management,” in Proceedings of the 4th
[45] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. S. Emer, ACM European Conference on Computer Systems, ser. EuroSys ’09.
“Adaptive insertion policies for high performance caching,” in Proceedings New York, NY, USA: ACM, 2009, pp. 89–102. [Online]. Available:
of the 34th Annual International Symposium on Computer Architecture, http://doi.acm.org/10.1145/1519065.1519076

You might also like