Clearing the Shadows: Recovering Lost Performance for
Invisible Speculative Execution through HW/SW Co-Design
Kim-Anh Tran
Christos Sakalis
Magnus Själander
Uppsala University
Uppsala, Sweden
kim-anh.tran@it.uu.se
Uppsala University
Uppsala, Sweden
christos.sakalis@it.uu.se
Norwegian University of Science and
Technology (NTNU)
Trondheim, Norway
magnus.sjalander@ntnu.no
Alberto Ros
Stefanos Kaxiras
Alexandra Jimborean
University of Murcia
Murcia, Spain
aros@ditec.um.es
Uppsala University
Uppsala, Sweden
stefanos.kaxiras@it.uu.se
Uppsala University
Uppsala, Sweden
alexandra.jimborean@it.uu.se
ABSTRACT
CCS CONCEPTS
Out-of-order processors heavily rely on speculation to achieve high
performance, allowing instructions to bypass other slower instructions in order to fully utilize the processor’s resources. Speculatively
executed instructions do not affect the correctness of the application, as they never change the architectural state, but they do affect
the micro-architectural behavior of the system. Until recently, these
changes were considered to be safe but with the discovery of new
security attacks that misuse speculative execution to leak secrete
information through observable micro-architectural changes (so
called side-channels), this is no longer the case. To solve this issue,
a wave of software and hardware mitigations have been proposed,
the majority of which delay and/or hide speculative execution until
it is deemed to be safe, trading performance for security. These
newly enforced restrictions change how speculation is applied and
where the performance bottlenecks appear, forcing us to rethink
how we design and optimize both the hardware and the software.
We observe that many of the state-of-the-art hardware solutions
targeting memory systems operate on a common scheme: the visible execution of loads or their dependents is blocked until they
become safe to execute. In this work we propose a generally applicable hardware-software extension that focuses on removing the
causes of loads’ unsafety, generally caused by control and memory
dependence speculation. As a result, we manage to make more loads
safe to execute at an early stage, which enables us to schedule more
loads at a time to overlap their delays and improve performance.
We apply our techniques on the state-of-the-art Delay-on-Miss
hardware defense and show that we reduce the performance gap
to the unsafe baseline by 53% (on average).
· Security and privacy → Hardware attacks and countermeasures; · Software and its engineering → Source code generation.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
PACT ’20, October 3ś7, 2020, Virtual Event, GA, USA
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8075-1/20/10. . . $15.00
https://doi.org/10.1145/3410463.3414640
KEYWORDS
speculative execution, side-channel attacks, caches, compiler, instruction reordering, coherence protocol
ACM Reference Format:
Kim-Anh Tran, Christos Sakalis, Magnus Själander, Alberto Ros, Stefanos
Kaxiras, and Alexandra Jimborean. 2020. Clearing the Shadows: Recovering
Lost Performance for Invisible Speculative Execution through HW/SW
Co-Design. In Proceedings of the 2020 International Conference on Parallel
Architectures and Compilation Techniques (PACT ’20), October 3ś7, 2020,
Virtual Event, GA, USA. ACM, New York, NY, USA, 14 pages. https://doi.org/
10.1145/3410463.3414640
1
INTRODUCTION
Side-channel attacks have been known to the security and hardware
communities for years, and they have been demonstrated to be
effective against a number of security systems [6, 9, 23]. Among
them, attacks that use the memory system as the side-channel, be
that the caches, the main memory, the memory bus, or even the
coherence mechanisms, have been particularly effective, partly due
to how easy it is to exploit them [23].
However, recently, with the introduction of Meltdown [21] and
Spectre [17], a new class of side-channel attacks has emerged: Speculative side-channel attacks. These attacks can still exploit the same
side-channels but they do so under speculative execution. This
makes them especially devastating because (i) software countermeasures can be easily bypassed during speculative execution (e.g.,
Spectre), (ii) hardware countermeasures might also be bypassed
during speculative execution (e.g., Meltdown), and finally because
(iii) the speculative execution might be squashed, leaving no trace
of anything malicious having ever happened. The reason why these
attacks are possible is because while architectural changes, such
as writes to architectural registers or the memory, are kept hidden
during speculative execution, micro-architectural changes are not.
These might include memory reads, which introduce changes in
the cache hierarchy [17], instruction execution, which introduces
1
2
uint8_t array [10];
uint8_t probe [256 * 64];
3
4
5
6
7
8
9
uint8_t victim ( size_t index ) {
if ( index < 10)
return array [ index ];
else
return 0;
}
10
11
12
13
14
15
16
17
18
19
20
void attack () {
// Train the branch predictor
for (...) victim (0) ;
// Flush the probe array from the cache
for (...) clflush ( probe [i * 64]) ;
// Speculatively load secret data
secret = victim (10000) ;
// Leak the secret value
_ = probe [ secret * 64];
}
Figure 1: Speculative side-channel attack example code.
resource contention [35], or even changing the frequency of the
core [33].
In this work we focus on attacks exploiting the memory system.
Figure 1 contains a simplified example that shows how such an
attack can be constructed. The exact same principle is used in the
Spectre v1 attack [17]. In this example, the attacker wants to bypass
the check enforced by the victim function (Line 5) in order to
perform an out-of-bounds access on array and access a memory
location containing a secret value (Line 17). The attack is performed
by:
(1) The attacker starts by training the branch predictor to assume that the if statement in victim is always true (Line
13). This can be done by simply calling the victim function
with valid indexes multiple times. Additionally, a probe array, which will be used later, is allocated and flushed from
the caches (Line 15).
(2) The attacker then proceeds to call the victim function with
an invalid index (Line 17). It will take some time before the
if condition can be checked but thanks to branch prediction
and speculative out-of-order execution the execution can
continue speculatively.
(3) Since the branch predictor has been trained to assume that
the if statement is always true, the execution will continue
speculatively by accessing the array with the invalid index.
The attacker then proceeds to use the secret value as an
index in the probe array (line 19). This will cause one cache
line from the array to be loaded into the cache; namely the
one that is indexed by the secret value.
(4) Once the branch misprediction is detected, the speculative
execution is squashed without causing any architectural
changes. The execution restarts at the if statement, this
time returning the value 0, indicating an error.
(5) The attacker can now probe the probe array, by trying each
possible index and measuring the time it takes to access the
array. Since one cache line was loaded during the attack, and
the index of that cache line depends on the secret value read
during the attack, the attacker can determine the secret value
by finding which cache line that takes less time to access
(not shown here).
Many state-of-the-art hardware defense mechanisms will try
to either delay or hide the speculative load that leads to the information leakage. In our example, the speculative access to array
[10000] would therefore not return a value that can be used to leak
the secret. While secure, such mechanisms suffer from reduced performance [30, 36, 39]. Being able to execute (speculative) loads in
parallel and out-of-order is crucial for performance. It allows memory latencies to be overlapped, which makes better use of existing
resources and achieves a high degree of memory-level-parallelism
(MLP). With MLP, the performance cost of memory operations can
be significantly reduced. With the delay of loads, memory accesses
have to be serialized and the gap between memory and processor
speed widens even more. In a way, disallowing the processor to
speculatively execute loads is a restriction on the out-of-orderness
of the out-of-order processor.
In this paper we look into the possibility to find MLP despite the
serializing effects of delaying speculative loads for security. Our
goal is to further close the performance gap between the unsafe
baseline and the secure out-of-order processor. To this end, we introduce a software-hardware extension that is generally applicable
to several existing hardware defense solutions. Our observation is
that delay-based mitigation techniques only allow the side-effects
of speculative loads to be observable as soon as they are deemed
as safe. What they all miss is that we can actually influence when
loads become safe if we can accomplish to remove the cause for
speculation at an early stage.
Our techniques remove the reason for speculation when possible, and otherwise shorten the period of time in which loads are
considered to be speculative. As a result, more loads become safe to
execute and we unlock and exploit the potential for MLP, and thus
for performance. Our contributions are:
(1) The proposal of a generally applicable software-hardware
extension to improve the performance of hardware solutions
that target speculative attacks on the memory system by
delaying or hiding loads and their dependents, including:
(a) The usage of a coherence protocol that allows loads to be
safely, non-speculatively, reordered under TSO [27], thus
unlocking the potential for MLP.
(b) An instruction reordering technique that exposes more
MLP through (i) prioritizing instructions that contribute to
unresolved memory operations and unresolved branches,
and through (ii) scheduling independent instructions in
groups.
(2) The evaluation of our extension on top of the state-of-theart Delay-on-Miss security mechanism [30], which delays
speculative loads that miss in the L1 cache.
Although we select a specific hardware defense to evaluate our
ideas, our solutions are not tied to a specific system. They are applicable to any hardware solutions that tackle observable memoryhierarchy side-effects by restricting the execution of loads and their
dependents, since this is what we focus on.
Completely disabling speculative execution would solve all speculative side-channel attacks, but it would come at an unacceptable
performance cost. Instead, the selective delay solution proposed by
Sakalis et al. [30] reduces the observable micro-architectural statechanges in the memory hierarchy while trying to delay speculative
instructions only when it is necessary. Specifically, only loads are
delayed, as other instructions (such as stores) are not allowed to
cause any changes in the memory hierarchy while speculative. In
addition, only loads that miss in their private L1 cache are delayed,
as loads that hit in the L1 cause minimal side-effects that can be
easily hidden until the load is no longer speculative. Sakalis et al.
name their technique Delay-on-Miss.
The authors introduce the concept of speculative shadows [29,
30] to understand when a load is considered to be speculative. Traditionally, any instruction that has not reached the head of the reorder
buffer might be considered speculative, but speculative shadows
offer a more fine-grained approach. Speculative shadows are cast
by instructions that may cause misspeculation, such as branches.
Branches need to be predicted early in the pipeline, as instructions
need to be fetched based on the branch target. If the branch is
mispredicted, then the wrong instructions might be executed, as
seen in the example in Section 1. However, there is no need to wait
until the branch reaches the head of the reorder buffer to mark it
as non-speculative, instead this can be done as soon as the branch
target has been verified. Therefore, the branch will cast a shadow
that extends from the moment the branch enters the reorder buffer
until the branch target is known.
The authors categorize the shadows into four types, depending
on the reason of misspeculation: the E-(exception), C-(control),
D-(data), and M-(memory) shadows. If value prediction is used,
a fifth type, the VP-(value prediction) shadow is also introduced,
but we are not exploring the use of value prediction in this work.
Table 1 shows an example for each shadow type. E-shadows relate to
instructions that may throw an exception, C-shadows are caused by
unresolved branches, D-shadows by potential data dependencies,
and, finally, M-shadows exist under memory models where the
observable memory order of loads has to be conserved, such as the
Total Store Order (TSO) model. Shadows are lifted as soon as the
reason for the potential misspeculation is resolved; for example, for
memory operations, the E-shadow is lifted as soon as the permission
checks can be performed. If a load is under any of these shadow
types then it is not allowed to be executed, unless it hits in the L1
cache.
Figure 2 shows the performance degradation of delaying unsafe
loads as described by Sakalis et al. on a range of SPEC 2006 benchmarks. Each benchmark is represented by a number of hot regions
that were identified through profiling (for more information on the
selection of regions for evaluation see Section 4). On average the
delay of loads incurs a 23% performance degradation compared to
Example
int x = a [ invalid ] /* throws */
int y = a [ i]
/* E - shadows are cast by any
instruction that may throw
an exception */
C-shadow
(Control)
if ( test ( i)) { /* unknown path
*/
int y = a [ i]
}
/* C - shadows are cast by
unresolved branches . */
D-shadow
(Data)
a [i ] = compute ()
int y = b [ i] /* a == b? */
/* D - shadows are cast by
potential data dependencies .
*/
M-shadow
(Memory)
int x = a [ i] /* load order in
TSO */
int y = a [ i +1]
/* M - shadows conserve the
observable load ordering
under TSO . */
Normalized Number of Cycles
SPECULATIVE SHADOWS AND
DELAY-ON-MISS
Type
E-shadow
(Exception)
2.0
1.5
1.0
0.5
0.0
ar
-re
as gwa
t
y
a ar- ob
bz sta way j
ip r-w ob
2- a j
de yo
co bj
m 2
pr
h2 ess
64
hm ref
lib
m
qu
er
a
lib nt
qu um lb
an -s m
tu igm
m
- a
m toff
m cf-p oli
ilc ri
-c m
om al
m put
ilc e
m pat
ilc h
-sh
if
na t
om md
ne
sje tp
ng p
sje -ge
so ng n
pl -s
ex td
-le
G ave
M
ea
n
2
Table 1: Examples for shadow types identified by Sakalis et
al. [30]. In each example, the load instruction in y = ... is
under a shadow cast by the previous instruction.
as
t
Our evaluation shows that our techniques improve over Delayon-Miss with 9%, and thus reduce the performance gap to the unsafe
baseline processor by 53%,
Figure 2: Impact of shadows on performance: the number
of cycles required for Delay-on-Miss running the selected
regions, normalized to the unsafe out-of-order processor
the unsafe, unmodified out-of-order core, measured in the number
of cycles required to execute the regions.
Figure 3 shows the contribution of loads, stores, control and other
instructions to the overall number of shadows that are cast for the
selected benchmarks. The largest proportion of shadows is cast by
Load
Store
Control
baseline
Normalized Causes of Speculation
1.0
17.5
15.0
0.8
12.5
0.6
10.0
0.4
7.5
5.0
0.2
2.5
0.0
as
ta
r-r
eg
as wa
ta yo
as r-w bj
bz tar ay
ip -w ob
2- a j
de yo
co bj2
m
pr
h2 ess
64
r
hm ef
lib
m
qu
er
a
lib nt
qu um lbm
an -s
tu igm
m
- a
m toff
c
m f-p oli
ilc rim
-c
o m al
m put
ilc e
m pat
ilc h
-sh
if
na t
om md
n
sje etp
ng p
sj -gen
so eng
pl -st
ex d
-le
av
e
0.0
Figure 3: Causes of speculation
Shadows
for ( int i = 0; i < 1000; ++ i ) {
int addr1 = ..;
int l 1 = p 1 ->a[ addr 1 ];
params . fullf [0] = l 1 ;
int addr 2 = ..;
l 2 = p2 ->a [ addr 2 ];
params . fullf [1] = l 2 ;
bool cond = l 1 < l 2 ;
if ( cond ) ..
}
C
/
E,M
E,D
/
E,M
E,D
/
C
/
Figure 4: Example code showing the type of shadows cast by
the instructions (E,C,M,D), and their overlap. Instructions
towards the end of the code excerpt are blocked by several
overlapping shadows and thus darker.
load and control instructions, only a small proportion is cast by
stores, and a minimal amount is cast by the remaining instructions,
such as floating point operations. In the following we will discuss
how to shorten the shadow duration of those instructions that
contribute most to the overall number of cast shadows, namely
load, store, and control instructions.
3
DoM
as
Average Number of Shadows Blocking a Load
ta
r-r
e
as gwa
t
y
a ar- ob
bz sta way j
ip r-w ob
2- a j
de yo
co bj
m 2
pr
h2 ess
64
hm ref
lib
m
qu
er
a
lib nt
qu um lb
an -s m
tu igm
m
- a
m toff
m cf-p oli
ilc ri
-c m
om al
m put
ilc e
m pat
ilc h
-sh
if
na t
om md
ne
sje tp
ng p
sje -ge
so ng n
pl -s
ex td
-le
G ave
M
ea
n
Other
EARLY SHADOW RESOLUTION AND
ELIMINATION
When the shadow that covers a load is resolved/removed, we
refer to that load as unshadowed, and the act as unshadowing a load.
For most loads, removing a single shadow is not enough, because
they are covered by multiple overlapping shadows and for the load
to become unshadowed, all shadows cast by preceding instructions
need to be removed. Consider Figure 4, which shows a code example
for overlapping shadows. To the right of the code we annotate
which types of shadow that line is casting. As an example, the first
line (for (int i = 0; i < 1000; ++i)) contains a comparison
(𝑖 < 1000) which is used to branch to the loop body. Unresolved
branches cast C-shadows, and therefore a shadow (illustrated with
a gray box) spans over the succeeding code. As almost all lines cast
shadows, an increasing number of shadows end up overlapping.
Figure 5: Average number of shadows that are blocking a
load at a time.
This example illustrates why simply removing one single shadow
does not make any difference: to successfully unshadow loads we
need to remove all overlapping shadows cast by the instructions
that lead up to each load.
For SPEC 2006, an average of 63% of the total number of dynamic instructions are either loads, stores, or branches [25]. This
means that at least 63% of the instructions in SPEC 2006 have the
potential to cast shadows1 . In Figure 5 we can see, on every cycle,
the average number of shadows that each load is simultaneously
under. The results show that there are on average five separate
overlapping shadows shadowing each load. Across all benchmarks,
the maximum number of distinct shadows that shadow a load at a
time is 59.
MLP: The Key to Performance. An important aspect of speculative
execution is allowing multiple loads to execute in parallel, which
enables faster loads (cache hits) to bypass long latency loads (cache
misses) and also multiple long latency loads to overlap with one
another. This results in memory-level-parallelism, which benefits
performance significantly. Shadows prevent multiple loads from
executing ahead of time, since sensitive information may be leaked
if a load is executed when it should not have been. Shadows thus
handicap the out-of-order processor’s capability to speculatively
execute instructions (loads) in an out-of-order fashion. The execution of loads is serialized, which affects the performance of both
memory- and compute-bound applications.
To successfully narrow the performance gap between the unsafe
and the secure out-of-order processor, we need to find ways to
increase MLP while maintaining the same security guarantees. But
how can we achieve MLP? Loads, stores, and branches are usually
interleaved in the code, and so are their shadows. For us to successfully unshadow loads, overlap their latencies and thus increase
MLP, we need to find solutions that consider all shadow types.
In the following sections we detail how this can be done. Table 2
gives an overview on the shadow-casting instructions and their
shadow types, as well as the techniques that we apply to remove
them. Where necessary, i.e., strong-consistent systems, we propose
1 Other
instructions, like floating point operations, may also cast shadows due to
exceptions. However, these exceptions can often be disabled through software.
Table 2: Overview on shadow-casting instructions, the shadows they cast (✗), and the solutions in this work to address them.
The percentage (%) shows the average number of shadows for which the instruction is responsible (for SPEC 2006 [25]). We exclude shadow-casting instructions that are not memory or control instructions, as their total share is negligible (see Section 2).
By excluding them, D- and E-shadows can be combined into one category.
Shadow
(%)
Load
(70%)
Store
(1.9%)
E-shadow
C-shadow
✗
✗
M-shadow
✗
Branch
(28%)
✗
End of Shadow when..
Unshadowing Technique
Target address known
Branch target address
known
Load has executed
Early Target Address Computation (Section 3.2)
Early Condition Evaluation (Section 3.2)
changing the coherence protocol to completely remove M-shadows.
We also propose applying compiler techniques to shorten, or in
some cases completely eliminate, the duration of E- and C-shadows.
D-shadows overlap with their respective E-shadows (both resolve
as soon as the address is known) and will therefore not be explicitly
mentioned in the following sections.
Non-speculative Load-Load Reordering (Section 3.1)
is not the case, such as on the numerous ARM systems utilizing
a Release Consistency (RC) memory model, the M-Shadows do
not exist and there is no reason to implement a non-speculative
load-load reordering solution, such as the one described above, to
eliminate them.
3.2
3.1
Non-Speculative Reordering of Loads
(M-shadows)
Among all shadows the M-shadows are the most restrictive on
MLP. They are cast by every single load, and even if all other shadows could be magically lifted, the M-shadows would still enforce
program order for all loads. Without the security concerns, an outof-order processor may speculatively bypass an older load if the
younger load is delayed (e.g., if its operands are not yet available).
A reordering is observed if two reordered loads obtain values that
contradict the memory order in which these values have been written. Consider two loads ld x, ld y that are executed on one core
and two stores st y, st x that are executed on another core. Let
𝑥 1 be the old value and 𝑥 2 the updated value of 𝑥 after the store
(similarly for 𝑦1 and 𝑦2 ). An illegal reordering under TSO would be
one in which the first load ld x loads 𝑥 2 (the updated value), but
the second load ld y loads 𝑦1 (the old value). This reordering can
happen if ld y bypasses ld x.
Since the M-shadows disallow reordering, loads are serialized,
which restricts our ability to improve MLP. To solve this, we propose
applying a method for non-speculative load-load reordering [27]
that allows reordering of loads while effectively hiding it through
the coherence protocol. In other words the execution of younger
loads before older loads is allowed (given that they they are independent and both are valid accesses to memory), but not revealed
to other cores. Consider the following scenario on the previous
example: a core bypasses ld x (e.g., because loading 𝑥 misses) and
executes ld y ahead of time. Another core now performs the store
operations to the same memory locations st y, st x. Since the
loads have been reordered, this would normally lead to an invalidation and therefore the squashing of the speculated load ld y.
Instead of squashing, the coherence protocol delays acknowledging the invalidation, such that both loads can finish execution and
the reordering cannot be detected any longer, thus eliminating the
possibility of a misspeculation.
Note that the M-Shadows is an artifact of systems that require
the aforementioned load-load order to be enforced, such as on x86
systems utilizing the TSO memory model. On systems where this
Early Evaluation of Conditions and
Addresses (C- and E-Shadows)
Both C- and E-shadows are lifted as soon as the physical target
addresses of memory operations and branch targets are known
(for branches, that means an early evaluation of their condition).
To shorten their shadow duration, we need to compute the target
addresses as early as possible. Unlike on a traditional out-of-order
processor, where we want to keep the address computation close
to the instruction to reduce register pressure, on the secure outof-order processor we want to hoist and overlap the computations
feeding loads and branches as much as it is necessary for all addresses to be ready, to be able to execute them in parallel and
ultimately gain MLP.
To this end we reorder the instructions to prioritize target address
computation of memory operations and condition evaluation of
branch conditions. To keep the problem tractable, we focus on local
reordering within basic blocks, as hoisting and lowering beyond
basic block boundaries is problematic for three reasons: While the
secure out-of-order processor cannot rely on branch prediction to
execute loads past unresolved branches it can still execute non-load
instructions past branches (safe, as they do not change the cache,
and therefore squashing them does not leave traces). As branch
prediction is very accurate these safe-to-be-executed instructions
will be executed whether or not they are hoisted across the branch.
Second, lowering memory operations and their uses to successors
would risk to delay execution more than necessary. Ideally we
would like to delay loads only as much as needed for the address
computation to be ready. Finally, on the compiler side, remaining
within the same basic block simplifies the analyses and reduces the
overhead introduced by hoisting these instructions.
The idea is to overlap address computation and branch condition
evaluation, such that they are ready as soon as possible to allow
the hardware to remove the C- and E-Shadows as early as possible. The algorithm consists of two parts, the generation of buckets
(Algorithm 2) from the original code, followed by the reordering of
instructions (Algorithm 1).
The idea behind the bucket generation is to find a representation
that groups the independent instructions and orders the dependent
Figure 6: The original code and the selected instructions to hoist (address computation for memory operations and branch
target, marked with ✗) are shown in Figure (a). The selected instructions and their dependencies are ordered into buckets as in
Figure (b). Instructions within a bucket are independent of each other. An instruction of a bucket has at least one dependency
on its preceding bucket. The buckets determine the order in which they will be hoisted to the beginning of the basic block.
The remaining instructions are kept in their original order. Figure (c) shows the resulting reordering.
instructions, with the goal of finding a legal reordering that overlaps
independent instructions and orders the dependent instructions
while maintaining the correct dependency. An instruction 𝑖 in a
bucket 𝑏 𝑗 is dependent on one or more instructions in 𝑏 𝑗−1 . Other
dependencies may reside in previous buckets 𝑏 𝑗−2 , .., 𝑏 0 ) too, but
there is at least one dependency chain from 𝑏 0 to 𝑏 𝑗−1 that forces 𝑖
to reside in 𝑏 𝑗 . All instructions within one bucket are independent
of each other. In the second step, the actual reordering, we select
those instructions that contribute to the address computation and
branch condition and hoist them according to the ordering specified
by the bucket ordering.
Figure 6 shows an example of the bucket creation. The code in
Figure 6 (a) is the first basic block of the code in our previous example in Figure 5. To only have one operation per line (which is closer
to the code the compiler sees in its intermediate representation), we
split some lines into two, and use the goto keyword to represent
the branch instruction at the end of the basic block. The instructions to hoist (i.e., if they contribute to any memory target address
and branch condition computation) are marked with ✗. Figure 6 (b)
shows the buckets created for the code, and Figure 6 (c) has the final
reordered code. If several instructions are to be hoisted that are
within the same bucket, we reorder the non-memory-operations
such that they precede the memory operations of the same bucket,
such that the address is ready by the time the memory operation is
issued (not shown in Figure 6).
Algorithm 1 shows our algorithm to reorder instructions. We go
through the basic block and collect all instructions that are of interest for hoisting (Line 2). We then find all the instructions that need
to be hoisted along with them, since they are dependencies that are
required for correct execution (Line 3). These dependencies are data
dependencies, aliasing (may- or must-aliasing) memory operations,
or instructions with side-effects that may change memory. Afterwards we apply the bucket creation on the collected instructions
(Line 4, detailed below) and hoist them according to their order
within the buckets (Line 5). The result is the reordered basic block.
We apply a top down approach for creating the buckets, see
Algorithm 2. Starting with the first instruction in the basic block,
Input: BasicBlock 𝐵𝐵
Output: Reordered BasicBlock
1 begin
2
instsToHoist ← FindInstsToHoist(BB)
3
targetInsts ← FindDepsRecursive(instsToHoist)
4
buckets ← SortInstIntoDepsBuckets(BB,
targetInsts)
5
BB𝑟𝑒𝑜𝑟𝑑𝑒𝑟𝑒𝑑 ← HoistInsts(buckets, BB)
6
return BB𝑟𝑒𝑜𝑟𝑑𝑒𝑟𝑒𝑑
7 end
Algorithm 1: Algorithm to identify and reorder the instructions
of interest
Input:
BasicBlock 𝐵𝐵, InstsToHoist 𝐻𝑜𝑖𝑠𝑡
Output: Buckets
begin
2
𝑏←0
3
instToBucket ← {}
4
foreach 𝑖𝑛𝑠𝑡 in BB do
5
if 𝑖𝑛𝑠𝑡 ∉ Hoist then
6
continue
7
deps ← GetDeps(𝑖𝑛𝑠𝑡)
8
depBucketNumber ← GetHighest(deps,
instToBucket)
9
𝑏 ← depBucketNumber + 1
10
instToBucket [𝑖𝑛𝑠𝑡] ← 𝑏
11
buckets [𝑏] ← 𝑖𝑛𝑠𝑡
12
end
13
return buckets
14 end
Algorithm 2: A top down approach to create the buckets containing the instructions to hoist and all their dependencies
1
we first check if it is selected for hoisting (Line 5). If it is, we collect
its dependencies, namely its operands, any preceding aliasing stores
if we encounter a load, and any preceding aliasing loads and stores
if encountering a store (Line 7). For each dependency we look up
which bucket it belongs to and record the highest found bucket
number (Line 8). If a dependency does not belong to the basic block
in focus we do not consider it. Since we go through the basic block
from from top to bottom, all dependencies have already been taken
care of in previous iterations, and their bucket number can be looked
up using a map (Line 8, Line 10). The bucket number of the current
instruction is the highest number of all its dependencies plus one
(Line 9). Finally, we add the current instruction to its corresponding
bucket (Line 11).
In our example we hoisted instructions that compute addresses
for memory operations or conditions for branch instructions. While
this is the most intuitive solution for the removal of E- and Cshadows, we also evaluate a version that chooses all instructions
within the basic block for reordering. The intuition behind this
is the following: allowing independent instructions to be issued
in-between increases the chance that the required addresses and
the branch condition are ready to be consumed as soon as they are
needed. In addition, by grouping and reordering all instructions,
we also schedule independent loads together, which may further
increase MLP. In Section 4 we evaluate both versions and will
see that choosing all instructions indeed turns out to be better for
performance in many cases.
3.3
Discussion on Security Guarantees of Our
Approach
Our paper makes use of three main components, (i) Delay-on-Miss,
(ii) non-speculative load-reordering, and (iii) early shadow resolution through instruction reordering. In this section we discuss
how Delay-on-Miss is effective against speculative side-channel
attacks and how our proposal maintains the security guarantees of
Delay-on-Miss.
3.3.1 Delay-on-Miss. Speculative loads can have visible side-effects
on the memory hierarchy, which can be exploited by attacks such
as Spectre to reveal secrets. Delay-on-Miss prevents speculative
side-channel attacks by delaying such speculative loads. Under
Delay-on-Miss, instructions that may cause a misspeculation are
said to cast a shadow on all instructions that follow them. When
such a shadow is cast by an instruction, it can be lifted only when it
is known that no misspeculation can originate from said instruction.
Loads that are under such shadows are categorized as speculative
and unsafe and, if they request data and the request misses in the
cache, are not allowed to proceed (i.e., they are delayed) until it is
deemed safe to do so (i.e., until they are unshadowed). If, however,
the request leads to a cache hit, the data is served, and instead only
actions that may cause side-effects (such as updating the replacement state) are delayed. These restrictions ensure that there are no
visible side-effects in the memory hierarchy that can be exploited
by speculative side-channel attacks.
Now that we have established that Delay-on-Miss protects against
Spectre and other similar attacks, we show that the components
added on top of Delay-on-Miss do not open up new security vulnerabilities.
To begin with, our instruction scheduling technique is conservative and does not reorder instructions speculatively. The scheduling
technique selects the set of instructions that contribute to either
the address computation of memory operations or the computation
of the branch target, and hoists them to the beginning of a basic
block (see Section 3.2 for more details). In order to make sure that
hoisting does not access data speculatively (which would open up a
security hole), we hoist along all preceding may- and must-aliasing
operations, as well as other operations that may have side effects
(such as function calls) when encountering memory operations.
Figure 7 shows a reordering example, where the set of instructions to hoist includes a memory operation. Figure 7(a) shows the
original code and the instructions that we initially select for hoisting. Note that one of the instructions that are selected performs a
load from memory (𝑝 1 → 𝑎[𝑎𝑑𝑑𝑟 1 ]), which follows a store to memory (𝑝 1 → 𝑎[𝑎𝑑𝑑𝑟 1 ] = 𝑥). Figure 7(b) shows the bucket creation
for the instructions to hoist to the beginning of the basic block.
In this case, the two memory operations may or must alias; and
since we want to be conservative, we include the store operation
when hoisting and respect the potential dependency (the load operation has to follow the store operation). Figure 7(b) depicts the
case, where at compile-time we know that these two operations are
independent of each other. In that case, the load operation may be
scheduled earlier than the store operation in focus (as it does not
access stale data), and the store operation is therefore not included
in the bucket creation.
The last component of our approach is the non-speculative loadload reordering, which does contain mechanisms than can cause
observable timing differences in the system. Specifically, it makes
use of lockdowns when a younger load is performed to delay acknowledging incoming invalidations. When a cache line is in lockdown, writers to that cache line are delayed, and this delay can
potentially be observed by the writers and used as a side-channel.
In our case, we do not introduce a new speculative side-channel,
because of the following:
While under a shadow other than an M-shadow, the rules of
Delay-on-Miss apply and no speculative loads are allowed to make
any visible changes to the memory hierarchy. This includes loads
that would need to get non-cacheable data or go into lockdown. As
the loads are covered by other overlapping shadows, removing the
M-shadow at this stage would not help in regaining MLP anyway.
Instead, a load is allowed to go into lockdown only after all other
shadows have been resolved and the load is shadowed by nothing
other than an M-shadow. At this stage, the M-shadow can be safely
removed, as the load reordering is now non-speculative and it will
not be squashed by an invalidation.
In essence, Delay-on-Miss itself prevents the possible speculative side-channel that would have been introduced by the nonspeculative load-load mechanism. At the same time, non-speculative
load-load reordering can also be used as a non-speculative sidechannel, when the attacker and the victim share physical memory.
Under such conditions, simpler, pre-existing, related side-channel
attacks, such as Invalidate+Transfer [14], can be exploited. Solutions for such non-speculative attacks already exist and can be
applied for the lockdown side-channel, but they fall outside the
scope of this work.
Figure 7: Non-speculative reordering of instructions. Figure (a) shows the original code. Initially all the instructions that contribute to address computation for either memory operations or branch target computation are chosen for hoisting (marked
with ✗). Figure (b) shows what buckets are created if the write to 𝑝 1 → 𝑎[𝑎𝑑𝑑𝑟 1 ] and loading from the value 𝑝 2 → 𝑎[𝑎𝑑𝑑𝑟 2 ]
may (or must) alias, i.e. loading the data before writing may lead to retrieving stale data (and would thus leak secrets). Note
that apart from the selected instructions for hoisting, the store operation is also included in the bucket creation, as we need
to make sure that we do not load stale data if 𝑝 1 → 𝑎[𝑎𝑑𝑑𝑟 1 ] and 𝑝 2 → 𝑎[𝑎𝑑𝑑𝑟 2 ] were to alias. However, if we know at compile
time that these memory operations do not alias, the write operation does not need to be included in the set of instructions to
hoist, see Figure (c).
Table 3: Simulation parameters used for Gem5
4
Parameter
Value
Technology node
Processor type
Processor frequency
Address size
Issue width
Cache line size
L1 private cache size
L1 access latency
L2 shared cache size
L2 access latency
22 nm
Out-of-order x86 CPU
3.4 GHz
64 bits
8
64 bytes
32 KiB, 2-way
2 cycles
512 KiB, 8-way
20 cycles
EVALUATION
We implement our ideas on top of the Delay-on-Miss proposal [30].
Next sections highlight our experimental set-up (Section 4.1) and
the performance results (Section 4.2).
4.1
Experimental Set-up
The compiler analysis and transformation is implemented on top of
LLVM (Version 8.0) [19]. We use Gem5 [4] with the Delay-on-Miss
implementation from Sakalis et al. [30] as our simulator. Table 3
shows the configuration chosen for simulation (i.e. a large out-oforder processor, the same set up as for the Delay-on-Miss work). The
baseline is always compiled with the highest possible optimization
(-O3). For evaluation we have chosen the SPEC CPU 2006 [12]
benchmark suite. We focus on the C and C++ workloads which we
were able to compile and run out-of-the-box using both LLVM and
Gem5.
Since our evaluation is based on simulation, we need to identify
relevant phases of the benchmark that can be simulated. On top
of this, we also need to make sure that each region is well-defined,
such that different simulation runs using different binaries can be
compared.
We compare the different binaries by focusing on the comparison
of statistics on hot regions that are identified using profiling. Table 4
lists the selected regions for each benchmark. For each region we
state (i) how many dynamic instructions it responds to in Gem5 (on
average), (ii) the percentage of runtime all executions of that region
would be attributed to relative to the whole program run, and (iii)
the total percentage when considering all regions of a benchmark.
The regions do not add up to 100% and there are several reasons
for this: the main loop may be recursive (thus too large to cover
as a whole within one simulation run), or the code may have a lot
of very small regions whose contribution to the overall execution
time is negligible. For (Gem5) practicality reasons we also do not
capture regions that start beyond three billion instructions.
The performance numbers in our work do not match (and cannot
be compared to) the performance numbers that were presented
for Delay-on-Miss [30]. Sakalis et al. have a different selection of
benchmarks, and second, we focus our evaluation on hot regions to
be able to compare our versions fairly, and therefore the simulated
regions do not match.
Evaluated Versions: Our baseline is the Delay-on-Miss running
on a large, unmodified out-of-order processor. Our extensions are
implemented on top of it. Table 5 shows the evaluated versions and
their respective names that will be used in the following.
4.2
Performance
As we compare different binaries, we use the total number of cycles
as a metric for performance (IPC is not a good fit because the
number of instruction varies for each binary). It reflects the number
of cycles that were required to finish the same amount of work, i.e.
the regions that we identified in Table 4.
Figure 8 shows the number of cycles normalized to DoM to show
the improvement relative to DoM. In the following we will mainly
Table 4: Benchmarks and the selected regions of interest (ROI). For each region, we list the average number of micro dynamic
instructions of one region run, and the total percentage of program runtime each region was contributed to (if we were to run
the whole program from start to end).
Benchmark
401.bzip2
429.mcf
433.milc
444.namd
450.soplex
456.hmmer
458.sjeng
462.libquantum
464.h264ref
470.lbm
471.omnetpp
473.astar
Normalized Number of Cycles
DoM+EC-Addr
Region of Interest
Average Number of Instructions
% of Runtime
BZ2_blockSort
BZ2_decompress
primal_net_simplex
path_product
u_shift_fermion
compute_gen_staple
doWork
leave
P7Viterbi
gen
std_eval
quantum_sigma_x
quantum_toffoli
encode_one_macroblock
LBM_performStreamCollide
do_one_event
regwayobj::makebound2
wayobj::fill
way2obj::fill
118,562,460
1,711,253
87,334
133,629,807
24,156,916
279,922,620
3,310,837
1,289,630
6,314,862
1841
3182
14,680,142
23,540,617
2,199,007
393,922,845
1810
3062
75,885,242
607,874,212
54.7%
15%
78.6%
26.4%
26.9%
21.2%
64.6%
31.7%
95.1%
18.5%
32.0%
18%
60.2%
98.6%
97%
92.3%
18.6%
48.1%
28.2%
DoM+EC-All
DoM+M
DoM+M+EC-Addr
Total %
69.7%
78.6%
74.5%
64.4%
31.7%
95.1%
50.5%
78.2%
98.6%
97%
92.3%
94.9%
DoM+M+EC-All
unsafe
1.0
0.8
0.6
0.4
0.2
n
G
M
ea
av
e
d
-st
ng
pl
ex
-le
so
sje
-g
ng
sje
ne
tp
en
p
d
m
om
na
ift
-sh
at
h
m
ilc
te
pu
-p
m
ilc
al
-c
m
ilc
cf
om
li
-p
rim
ffo
um
qu
an
t
lib
m
-si
-to
gm
a
m
er
m
lb
um
qu
an
t
as
hm
lib
as
ta
r-r
eg
wa
yo
b
j
ta
r-w
ay
ob
as
j
ta
r-w
ay
ob
bz
j2
ip
2bl
oc
ks
or
t
h2
64
re
f
0.0
Figure 8: Normalized number of cycles for Delay-on-Miss with our extensions (DoM+M, DoM+EC-All, DoM+EC-Addr, and
their combinations DoM+M+EC-Addr, DoM+M+EC-All), and the unsafe out-of-order (unsafe). Baseline is Delay-on-Miss as in
Sakalis et al. (see red line).
focus on comparing our extensions with DoM, however, we will
give some insight on the performance differences of our work to
the unsafe out-of-order in Section 4.3.
The Effect of Removing M-shadows on Performance. Figure 8 shows
that by only introducing the coherence protocol on top of Delayon-Miss we can significantly improve performance (see DoM+M).
By allowing loads to be reordered, DoM+M achieves to improve
DoM by 7% (on average). M-shadows enforce an ordering on loads
which restricts the parallel execution of loads. Using the coherence
protocol we can completely disable the M-shadows, which allows
the processor to execute loads (if not still shadowed by another
instruction) in parallel and therefore overlap their delays. This allows for better resource usage and helps to hide the long latencies
that memory accesses introduce. While some benchmarks benefit a
lot from removing the M-shadows (such as milc, 37% (-shift), 23%
(-compute), 20% (-path), and omnetpp, 22%), others are not affected
at all (such as lbm, astar, and h264ref). There are several aspects that
play a role in deciding whether or not the removal of M-shadows
will have a positive effect on performance.
Table 5: Evaluated Versions. All versions are based on Delayon-Miss (DoM). For the E- and C-shadows we evaluate two
versions: one that reorders all the instructions (All), and one
that only reorders the memory and branch target address
computation (Addr), see Section 3.2 for more details. If a cell
is marked (✗), it is enabled for the version of that row.
M-shadows
E- and C-shadows
All
Addr
✗
DoM
DoM+M
DoM+EC-All
DoM+EC-Addr
✗
DoM+M+EC-All
DoM+M+EC-Addr
✗
✗
✗
✗
Version Name
✗
Benchmarks that benefit from the removal of M-shadows are
likely to exhibit many cache misses that can be overlapped, to efficiently use the hardware resources and thus gain in performance.
On top of this, it is beneficial if there is little control flow (few
C-shadows) and little address computation that is required (short
E-shadows). All milc regions fall into this category. Milc is categorized as a memory-bound benchmark [15]. Looking at the hot
regions, milc makes use of matrix operations that include a number
of independent load operations that access memory using simple,
constant indices. Since the address computation is quick to finish,
many of the E-shadows are likely to be very short. On top of this,
milc has only little control flow, and thus not many overlapping
C-shadows that would otherwise block loads from executing. This
combination of characteristics makes milc a good fit for DoM+M.
On the other side, benchmarks that have loads that are dependent
on each other (i.e. indirection chains, such as 𝑥 [𝑦 [𝑧]]), cannot be
exploited for increasing MLP as their accesses have to be serialized.
Such a dependence chain may also happen if a long latency load
feeds the branch condition, since any (missing) load after the branch
cannot be executed until the branch target is known (C-shadows).
One memory-bound benchmark [15] that cannot profit from the
M-shadow removal is astar. Astar’s hot regions include tight basic
blocks with nested branches and with interleaved loads and stores.
Removing just the M-shadows is therefore not enough to achieve
higher levels of MLP.
The Effect of Removing E- and C-shadows on Performance. DoM+ECAddr and DoM+EC-All explore the effect of our instruction reordering technique on top of DoM. Since the M-shadows are not lifted
for these two versions, all loads are still serialized and no MLP
can be exploited. While the reordering alone does improve performance for a few benchmarks (e.g., 4% improvement on sjeng-std
with DoM+EC-Addr, and 4% improvement with DoM+EC-All on
bzip2-decompress), they also introduce overhead for others and
cancel out the benefit (e.g., 13% decrease for DoM+EC-All on milcpath). On average, both versions do not benefit on their own, since
they are designed to increase the degree of MLP, given that MLP
can be exploited (which it cannot, if M-shadows are in place).
Although the reordering is intended to be combined with the
coherence protocol, there are some cases in which reordering has a
positive impact on the performance. Our reordering changes the
original code in two ways. First, we (try) to start all independent
chains as early as possible, and second, we schedule independent
instructions of different chains back-to-back. Shadows handicap
the out-of-order processor in its out-of-orderness and it can no
longer freely choose the instructions to execute. As a result, it relies
more on the schedule determined by the compiler than its unsafe
baseline, similar to smaller processors that do not have the ability
to look far ahead into the code. By splitting the dependencies and
scheduling independent instructions in-between, dependencies are
more likely to already be resolved as soon as they are considered
for execution.
Reordering all instructions comes however at a risk. DoM+ECAll may introduce an overhead by keeping many live values around
that may impact performance negatively. This can be the case if a
basic block is large and contains many independent instructions
that can be overlapped. This is what happens for lbm: lbm’s hot
region contains a for loop with a big basic block with many independent instructions that are grouped into a bucket and are scheduled
together. Naturally, this leads to increased register pressure: the
assembly file for DoM counts 26 spills, the one for DoM+EC-All 58
spills. As a result, there is an increased number of instructions required for spilling and reloading (apart from increasing the number
of instructions, this also leads to an increased number of shadows). Figure 9 plots the total number of committed instructions
normalized to DoM for each benchmark. For most benchmarks the
number of instructions is roughly the same, but lbm shows a significant increase in instructions for the two versions DoM+EC-All and
DoM+M+EC-All. This increase finally reflects in the decreased performance for lbm (5% performance degradation for DoM+EC-All,
and 7% for DoM+M+EC-All respectively).
Putting everything together: The Effect of Removing M-, E- and
C-shadows on Performance. DoM+M+EC-Addr and DoM+M+EC-All
combine software reordering to tackle E- and C-shadows with load
reordering to eliminate M-shadows, and improve DoM on average
by 8% and 9% respectively. Most benefit comes from eliminating
M-shadows, combining the load reordering with software reordering improves performance for a few single benchmarks (highest
are libquantum-toffoli with 10% and namd with 18%). Where does
the benefit from software reordering come from? The benefits are
achieved when reordering all instructions within the block (i.e.
when using DoM+M+EC-All). As mentioned previously, the approach to group independent instructions and schedule them as
early as possible may allow enough delay between the branch- and
memory operation-feeding instructions to finish just in time. This
would make it unnecessary to cast any shadows in the first place, or
to at least shorten the duration in which the operation is casting a
shadow. On top of that, we may even further increase MLP by grouping independent loads and scheduling them together. Looking at
the hot regions within namd, we can identify basic blocks that have
many groups of independent loads (that were not grouped before),
which is potentially the reason for the performance improvement.
However, for the majority of benchmarks the reordering does not
help much. The reason is that we are limiting our reordering to the
bounds of a basic block. Many times basic blocks only consist of
few instructions, or instructions that cannot be moved due to existing dependencies within the block. In these cases, our reordering
DoM+EC-All
DoM+M
DoM+M+EC-Addr
DoM+M+EC-All
1.2
1.0
0.8
0.6
0.4
0.2
ea
n
M
ve
G
lex
-le
a
so
p
n
gst
d
sje
n
p
tp
gge
sje
n
m
d
na
om
ne
m
-to
tu
an
lib
qu
ffo
li
m
cf
-p
r im
m
al
ilc
-c
om
pu
te
m
ilc
-p
at
h
m
ilc
-sh
ift
m
a
m
lb
m
-si
g
tu
an
qu
lib
as
ta
r-r
hm
m
er
wa
yo
bj
as
ta
r-w
ay
ob
as
j
ta
r-w
ay
ob
bz
j2
ip
2bl
oc
ks
or
t
h2
64
re
f
0.0
eg
Normalized Number of Committed Insts
DoM+EC-Addr
Figure 9: Normalized number of instructions committed for Delay-on-Miss with our extensions (DoM+M, DoM+EC-All,
DoM+EC-Addr, and their combinations DoM+M+EC-Addr, DoM+M+EC-All). All numbers are normalized to the unmodified
Delay-on-Miss (DoM).
cannot properly address the early removal of C- and E-shadows
and we completely rely on the M-shadow removal.
While DoM+M+EC-Addr was the intuitive solution to eliminate
E- and C-shadows and to increase MLP, we find that DoM+M+ECAll performs better overall. The drawbacks of DoM+M+EC-All are
basically the same as for DoM+EC-All, as they both make use of
the exact same binary, but with a different coherence protocol. As
such, DoM+M+EC-All suffers from increased register pressure if
too many independent chains of instructions exist, that will all be
scheduled right from the start.
Overall, the best version (if one were to select one) is DoM+M+ECAll, which combines load reordering with instruction scheduling
targeting all instructions to exploit MLP. On average, it improves
DoM by 9%. The unmodified out-of-order processor is better than
DoM by 19%, thus, our techniques close the performance gap between DoM and the unsafe out-of-order by 53%.
4.3
More Data to Understand the Performance
Benefit
In the previous sections we discussed how our techniques to remove
M-, E-, and C-shadows can be beneficial for MLP and thus for
performance. In this section we want to show more data to support
our previous numbers, and to better understand where the benefit
comes from.
Figure 10 plots the average shadow duration measured in cycles
for all versions, with DoM being the baseline. The graph shows
clearly that DoM+M, DoM+M+EC-Addr, and DoM+M+EC-All reduce the overall duration over DoM (32 cycles for dom, 14 for
DoM+M, 13 for DoM+M+EC-Addr, and 12 for DoM+M+EC-All, on
average). With shorter shadows, more instructions can be issued at
a time (including loads): Figure 11 shows the average number of
instructions that are issued per cycle, for the unsafe, unmodified
out-of-order (red) and all evaluated versions (colors consistent with
previous plots). A high average number of issued instructions per
cycle translates to higher performance, as can be seen comparing
Figure 11 and Figure 8. On average, DoM issues 0.98 instructions
per cycle. For DoM+M this number is 1.06, for DoM+M+EC-Addr
1.07, and for DoM+M+EC-All 1.09.
Interestingly, for a few benchmarks, such as libquantum-toffoli
and namd, DoM+EC-All and DoM+M+EC-All issue more instructions per cycle than the baseline (unsafe out-of-order processor),
and require fewer cycles to finish the regions of interest (see Figure 8). One reason may be a fortunate combination of remaining
shadows preventing misspeculation penalties and reordering. Shadows prevent the out-of-order processor from speculating, and therefore also from misspeculating. Libquantum-toffoli is known to be
a memory bound benchmark [15], such that the unsafe processor
often needs to speculate past loads. As misspeculation imposes
significant overhead if it happens, preventing it may be the better choice. In combination with reordering under these shadows,
the processor may (instead of wrongly speculating and squashing)
execute useful instructions that are known to be safe.
5
RELATED WORK
Side-channel Attacks. This work focuses on speculative sidechannel attacks, which were first introduced in the early 2018 with
the announcement of Meltdown [21] and Spectre [17]. Since these
two original attacks, numerous variants that exploit different parts
of the system have been introduced (e.g. [3, 5, 7, 18, 31, 33, 38]), but
they all share the same two parts: Misdirecting execution to speculatively bypass software and/or hardware checks to gain access
to secret data and then leaking that data thought a side-channel.
The security solutions that we are targeting with this work, such
as Delay-on-Miss [30] or InvisiSpec [39] (the future-proof version),
are not concerned with how the execution is misdirected, instead
they focus on preventing the leakage of information through sidechannels during speculative execution. Because of this, these security mechanisms are agnostic to the specifics of the attack and
instead try to prevent speculative state from being produced and/or
leaked. The solutions we propose are not specific to certain attacks, and instead consider information leakage from speculative
execution as a general problem.
Software Mitigations. Software mitigations include speculation
barriers and conditional select/move instructions [2, 13]. Barriers
prevent speculation altogether and impose a significant restriction
on performance. While the compiler may analyze code at risk, static
analysis identifying vulnerable code is not complete. Retpotline [10]
DoM
DoM+EC-Addr
DoM+EC-All
DoM+M
200
Average Shadow Duration
DoM+M+EC-Addr
765.4
1572.6
1505.3
781.9
1480.6
774.8
DoM+M+EC-All
272.1
252.6
288.2
150
100
50
s je
ng
-g
en
s je
ng
-st
d
so
pl
ex
-le
av
e
G
M
ea
n
ne
tp
p
d
om
na
m
qu
lb
m
an
tu
m
-si
lib
gm
qu
an
a
tu
m
-to
ffo
li
m
cf
-p
rim
m
al
ilc
-c
om
pu
te
m
ilc
-p
at
h
m
ilc
-sh
ift
er
m
hm
lib
as
ta
r-r
eg
wa
yo
bj
as
ta
r-w
ay
ob
as
j
ta
r-w
ay
ob
bz
j2
ip
2bl
oc
ks
or
t
h2
64
re
f
0
Figure 10: Average shadow duration in cycles for our extensions (DoM+M, DoM+EC-Addr, DoM+EC-All, DoM+M+EC-Addr,
and DoM+M+EC-All), with DoM being the baseline
DoM+EC-Addr
DoM+EC-All
DoM+M
DoM+M+EC-Addr
DoM+M+EC-All
3.5
3.0
2.5
2.0
1.5
1.0
0.5
n
G
M
ea
av
e
d
-le
-st
ex
pl
so
-g
ng
sje
ng
sje
om
ne
tp
en
p
d
m
-sh
na
ift
h
ilc
m
m
ilc
-p
pu
at
te
al
m
ilc
-c
cf
m
m
tu
an
lib
qu
om
li
-p
rim
ffo
-to
gm
a
m
-si
lb
m
lib
qu
an
as
eg
ta
r-r
as
tu
j
ta
r-w
ay
ob
as
j
ta
r-w
ay
ob
bz
j2
ip
2bl
oc
bz
ks
ip
or
2t
de
co
m
pr
es
s
h2
64
re
f
hm
m
er
0.0
wa
yo
b
Number of Instructions Issued per Cycle
baseline
DoM
Figure 11: Average number of instructions issued per cycle for the unmodified out-of-order processor (baseline), Delay-on-Miss
(DoM) and Delay-on-Miss with our extensions (DoM+M,DoM+EC-All, DoM+EC-Addr, and their combinations DoM+M+ECAddr, DoM+M+EC-All)
("return trampoline") prevents speculation on indirect branches by
trapping speculative execution in an infinite loop, by replacing the
indirect jump by a call/return combination. Execution only exits
the loop as soon as the branch target is known. Attacks targeting
conditional branches may be circumvented by introducing a poison
value that poisons loaded values on misspeculated paths [24]. A similar approach is taken by LLVM’s speculative load hardening [22],
which zeroes out pointers before loading them, if they are mispredicted. KAISER [11] protects against Meltdown by enforcing strict
user and kernel space isolation but is ineffective against Spectre.
Other software-based mitigations [8, 20, 32] propose annotationbased mechanisms for protecting secret data, as an effort to reduce
the overhead, but require additional hardware, compiler, and OS
support.
Invisible Speculation in Hardware. There are three main approaches
when it comes to preventing speculative execution from causing
measurable side-effects in the system:
(1) Hiding the side-effects of speculative execution until
speculation is resolved. This approach is taken by solutions such as SafeSpec [16], InvisiSpec [39], and Ghost
Loads [29], and MuonTrap [1]. They hide the side-effects
of transient instructions in specially designed buffers that
keep them hidden until the speculation is resolved and the
side-effects can be made visible. Since these approaches have
to wait before they can make the side-effects visible, they
incur a performance cost relative to how long the side-effects
need to be hidden. Our work can help all of these solutions
by reducing and sometimes eliminating the delay before performing an instruction and when its side-effects can be made
visible to the system.
(2) Delaying speculative execution until speculation can
be resolved. Solutions such as Delay-on-Miss [30], Conditional Speculation [20], SpectreGuard [8], NDA [37], and
Speculative Taint Tracking (STT) [40, 41] selectively delay
instructions when they might be used to leak information.
Some, such as Conditional Speculation and SpectreGuard,
only try to protect data marked by the user as sensitive, while
others, such as Delay-on-Miss, work on all data. NDA and
STT focus on preventing the propagation of unsafe values at
their source, based on the observation that a successful speculative side-channel attack consists of two dependent parts, (i)
an illegal access (i.e., a speculative load) and (ii) one or more
instructions that dependent to the illegal access and leak the
secret. Instead of waking up instructions in the instruction
queue as soon as their operands are ready, NDA wakes up
instructions as soon as they are safe. This way, NDA prevents secrets from propagating. Similarly, STT taints access
instructions (instructions that may access secrets, i.e., loads)
and untaints them as soon as they are considered safe (i.e. if
all their operands are untainted). While the execution of load
instructions is allowed, the execution of their dependents
is delayed. In comparison to Delay-on-Miss, NDA and STT
therefore only delay the transmit instructions.
The common theme in all of them is that some speculative instructions are considered unsafe under specific conditions and need to be delayed until the speculation has
been resolved. Our work can help to reduce the performance
overhead caused by delays by reducing and sometimes completely eliminating the duration under which instructions
are speculative.
(3) Undoing the side-effects of speculative execution in
the case of a misspeculation. CleanupSpec [28] takes a
different approach to the previous solutions by permitting
speculative execution to proceed unhindered and undoing
any side-effects in the event of a misspeculation. The main
cost comes from having to undo the side-effects after a misspeculation. Our work focuses on detecting correct speculation early, so it would not benefit CleanupSpec significantly.
Instead, a similar solution would have to focus on detecting
misspeculation early, to reduce the undoing cost. However,
such a solution is outside the scope of this work and is left
as future work.
Other Designs. Other approaches hoist branch conditions to
avoid branch prediction (and thus the necessity of C-shadows)
to separate loops [34]. Usually the splitting of condition and branch
happen not within the basic block, but spans a bigger code range,
since they aim at reordering of conditions that are originally not
within the processor’s view at a point, the instruction window.
Similar to non-speculative load-load reordering, which modifies
the coherence protocol to let reordered loads appear serialized and
thus avoid expensive squashes, OmniOrder [26] achieves efficient
execution of atomic blocks in a directory-based coherence environment by letting the atomic blocks appear serialized. The main
idea behind it is to keep speculative updates in a per-processor
buffer, and to leave the basic coherence protocol unmodified. The
history of non-speculative updates and their origin is moved along
with each coherence transaction, and the receiving processor becomes responsible for merging or squashing the speculative data
whenever a transaction is committed or squashed.
6
CONCLUSION
With the discovery of speculative side-channel attacks, speculative
execution is no longer considered to be safe. To mitigate the new
vulnerability many hardware solutions choose to either delay or
hide speculative accesses to memory until they are considered as
safe. While sensitive data is safe from being leaked, this approach
trades performance for security.
In this work, we take a look at hardware defenses that focus on
restricting the execution of loads and their dependents and only
reveal their side-effects as soon as they are deemed as safe. We
analyze the conditions that need to be met for an unsafe load to
become safe, and observe that through instruction reordering we
can actually influence and shorten the period of time, in which a
load is considered to be unsafe to execute. In combination with a
coherence protocol that enables safe load reordering even under
consistency models that require memory ordering, we unlock the
potential for memory-level-parallelism and thus for performance.
We introduce and evaluate our extension on top of a state-of-theart hardware defense mechanism, and show that we can improve
its performance by 9% on average, and thus reduce the overall
performance gap to the unsafe out-of-order processor by 53% (on
average).
ACKNOWLEDGMENTS
This work was partially funded by Vetenskapsrådet project 201505159, 2016-05086, and 2018-05254, by the European joint Effort
toward a Highly Productive Programming Environment for Heterogeneous Exascale Computing (EPEEC) (grant No 801051) and by
the European Research Council (ERC) under the European Union’s
Horizon 2020 research and innovation programme (grant agreement No 819134). The computations were performed on resources
provided by SNIC through Uppsala Multidisciplinary Center for
Advanced Computational Science (UPPMAX) under Project SNIC
2019-3-227.
REFERENCES
[1] Sam Ainsworth and Timothy M. Jones. 2020. MuonTrap: Preventing CrossDomain Spectre-Like Attacks by Capturing Speculative State. https://doi.org/10.
1109/ISCA45697.2020.00022
[2] ARM. [n.d.]. Cache Speculation Side-channels. ([n. d.]). Online https://developer.
arm.com/support/arm-security-updates/speculative-processor-vulnerability; accessed 27-October-2019.
[3] Atri Bhattacharyya, Alexandra Sandulescu, Matthias Neugschwandtner, Alessandro Sorniotti, Babak Falsafi, Mathias Payer, and Anil Kurmus. 2019. SMoTherSpectre: exploiting speculative execution through port contention. arXiv:1903.01843
[cs] (March 2019). http://arxiv.org/abs/1903.01843 arXiv: 1903.01843.
[4] Nathan L. Binkert, Bradford M. Beckmann, Gabriel Black, Steven K. Reinhardt,
Ali G. Saidi, Arkaprava Basu, Joel Hestness, Derek Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib Bin Altaf, Nilay
Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH
Computer Architecture News 39, 2 (2011), 1ś7.
[5] Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank
Piessens, Mark Silberstein, Thomas F. Wenisch, Yuval Yarom, and Raoul Strackx.
2018. Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution. 991ś1008. https://www.usenix.org/conference/
usenixsecurity18/presentation/bulck
[6] Claudio Canella, Jo Van Bulck, Michael Schwarz, Moritz Lipp, Benjamin von
Berg, Philipp Ortner, Frank Piessens, Dmitry Evtyushkin, and Daniel Gruss.
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
2019. A Systematic Evaluation of Transient Execution Attacks and Defenses.
In 28th USENIX Security Symposium (USENIX Security 19). USENIX Association,
Santa Clara, CA, 249ś266. https://www.usenix.org/conference/usenixsecurity19/
presentation/canella
Guoxing Chen, Sanchuan Chen, Yuan Xiao, Yinqian Zhang, Zhiqiang Lin, and
Ten H. Lai. 2018. SgxPectre Attacks: Stealing Intel Secrets from SGX Enclaves
via Speculative Execution. arXiv:1802.09085 [cs] (Feb. 2018). http://arxiv.org/abs/
1802.09085 arXiv: 1802.09085.
Jacob Fustos, Farzad Farshchi, and Heechul Yun. 2019. SpectreGuard: An Efficient
Data-centric Defense Mechanism against Spectre Attacks. In Proceedings of the
56th Annual Design Automation Conference 2019 on - DAC ’19. ACM Press, Las
Vegas, NV, USA, 1ś6. https://doi.org/10.1145/3316781.3317914
Qian Ge, Yuval Yarom, David Cock, and Gernot Heiser. 2018. A survey of microarchitectural timing attacks and countermeasures on contemporary hardware. Journal of Cryptographic Engineering 8, 1 (April 2018), 1ś27. https:
//doi.org/10.1007/s13389-016-0141-6
Google. [n.d.]. Retpoline: a software construct for preventing branch-targetinjection. ([n. d.]). Online https://support.google.com/faqs/answer/7625886;
accessed 27-October-2019.
Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clémentine Maurice,
and Stefan Mangard. 2017. KASLR is Dead: Long Live KASLR. In ESSoS (Lecture
Notes in Computer Science, Vol. 10379). Springer, 161ś176.
John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. Computer
Architecture News 34, 4 (2006), 1ś17.
Intel. [n.d.]. Intel Analysis of Speculative Execution Side Channels. ([n. d.]). Online https://www.intel.com/content/www/us/en/architecture-and-technology/
intel-analysis-of-speculative-execution-side-channels-paper.html; accessed 27October-2019.
Gorka Irazoqui, Thomas Eisenbarth, and Berk Sunar. 2016. Cross Processor Cache
Attacks. In AsiaCCS. ACM, 353ś364.
Aamer Jaleel. 2010.
Memory Characterization of Workloads Using
Instrumentation-Driven Simulation. (2010). Online; accessed 06-January-2020.
Web Copy: http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf.
K. N. Khasawneh, E. M. Koruyeh, C. Song, D. Evtyushkin, D. Ponomarev, and
N. Abu-Ghazaleh. 2019. SafeSpec: Banishing the Spectre of a Meltdown with
Leakage-Free Speculation. In 2019 56th ACM/IEEE Design Automation Conference
(DAC). 1ś6.
Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz
Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom.
2019. Spectre attacks: Exploiting speculative execution. 19ś37. https://doi.org/
10.1109/SP.2019.00002
Esmaeil Mohammadian Koruyeh, Khaled N. Khasawneh, Chengyu Song, and
Nael Abu-Ghazaleh. 2018. Spectre Returns! Speculation Attacks using the Return
Stack Buffer. https://www.usenix.org/conference/woot18/presentation/koruyeh
Chris Lattner and Vikram S. Adve. 2004. LLVM: A Compilation Framework for
Lifelong Program Analysis & Transformation. In CGO. IEEE Computer Society,
75ś88.
Peinan Li, Lutan Zhao, Rui Hou, Lixin Zhang, and Dan Meng. 2019. Conditional
Speculation: An Effective Approach to Safeguard Out-of-Order Execution Against
Spectre Attacks. In 2019 IEEE International Symposium on High Performance
Computer Architecture (HPCA). IEEE, Washington, DC, USA, 264ś276. https:
//doi.org/10.1109/HPCA.2019.00043
Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas,
Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg.
2018. Meltdown. arXiv:1801.01207 http://arxiv.org/abs/1801.01207
LLVM. [n.d.]. Speculative Load Hardening. ([n. d.]). Online https://llvm.org/
docs/SpeculativeLoadHardening.html; accessed 16-January-2020.
Yangdi Lyu and Prabhat Mishra. 2018. A Survey of Side-Channel Attacks on
Caches and Countermeasures. Journal of Hardware and Systems Security 2, 1
(March 2018), 33ś50. https://doi.org/10.1007/s41635-017-0025-y
Ross McIlroy, Jaroslav Sevcík, Tobias Tebbi, Ben L. Titzer, and Toon Verwaest.
2019. Spectre is here to stay: An analysis of side-channels and speculative
execution. CoRR abs/1902.05178 (2019).
Aashish Phansalkar, Ajay Joshi, and Lizy Kurian John. 2007. Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite. In ISCA.
ACM, 412ś423.
Xuehai Qian, Benjamin Sahelices, and Josep Torrellas. 2014. OmniOrder:
Directory-based conflict serialization of transactions. In ISCA. IEEE Computer
Society, 421ś432.
Alberto Ros, Trevor E. Carlson, Mehdi Alipour, and Stefanos Kaxiras. 2017. NonSpeculative Load-Load Reordering in TSO. In ISCA. ACM, 187ś200.
Gururaj Saileshwar and Moinuddin K. Qureshi. 2019. CleanupSpec: An "Undo"
Approach to Safe Speculation. In Proceedings of the 52Nd Annual IEEE/ACM
International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52).
ACM, New York, NY, USA, 73ś86. https://doi.org/10.1145/3352460.3358314
Christos Sakalis, Mehdi Alipour, Alberto Ros, Alexandra Jimborean, Stefanos
Kaxiras, and Själander Magnus. 2019. Ghost Loads: What is the Cost of Invisible
Speculation? 153ś163. https://doi.org/10.1145/3310273.3321558
[30] Christos Sakalis, Stefanos Kaxiras, Alberto Ros, Alexandra Jimborean, and Magnus Själander. 2019. Efficient invisible speculative execution through selective
delay and value prediction. In ISCA. ACM, 723ś735.
[31] Michael Schwarz, Moritz Lipp, Daniel Moghimi, Jo Van Bulck, Julian Stecklina, Thomas Prescher, and Daniel Gruss. [n.d.]. ZombieLoad: Cross-PrivilegeBoundary Data Sampling. ([n. d.]), 15.
[32] Michael Schwarz, Robert Schilling, Florian Kargl, Moritz Lipp, Claudio Canella,
and Daniel Gruss. 2019.
ConTExT: Leakage-Free Transient Execution.
arXiv:1905.09100 [cs] (May 2019).
http://arxiv.org/abs/1905.09100 arXiv:
1905.09100.
[33] Michael Schwarz, Martin Schwarzl, Moritz Lipp, and Daniel Gruss. 2018. NetSpectre: Read Arbitrary Memory over Network. (July 2018). https://arxiv.org/
abs/1807.10535
[34] Rami Sheikh, James Tuck, and Eric Rotenberg. 2015. Control-Flow Decoupling:
An Approach for Timely, Non-Speculative Branching. IEEE Trans. Computers 64,
8 (2015), 2182ś2203.
[35] Dimitrios Skarlatos, Mengjia Yan, Bhargava Gopireddy, Read Sprabery, Josep
Torrellas, and Christopher W. Fletcher. 2019. MicroScope: Enabling Microarchitectural Replay Attacks. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA ’19). ACM, New York, NY, USA, 318ś331.
https://doi.org/10.1145/3307650.3322228
[36] Ofir Weisse, Ian Neal, Kevin Loughlin, Thomas F. Wenisch, and Baris Kasikci.
2019. NDA: Preventing Speculative Execution Attacks at Their Source. In MICRO.
ACM, 572ś586.
[37] Ofir Weisse, Ian Neal, Kevin Loughlin, Thomas F. Wenisch, and Baris Kasikci. 2019.
NDA: Preventing Speculative Execution Attacks at Their Source. In Proceedings of
the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO
’52). ACM, New York, NY, USA, 572ś586. https://doi.org/10.1145/3352460.3358306
event-place: Columbus, OH, USA.
[38] Ofir Weisse, Jo Van Bulck, Marina Minkin, Daniel Genkin, Baris Kasikci, Frank
Piessens, Mark Silberstein, Raoul Strackx, Thomas F. Wenisch, and Yuval Yarom.
2018. Foreshadow-NG: Breaking the virtual memory abstraction with transient
out-of-order execution. (Aug. 2018). https://lirias.kuleuven.be/2089352
[39] Mengjia Yan, Jiho Choi, Dimitrios Skarlatos, Adam Morrison, Christopher W.
Fletcher, and Josep Torrellas. 2018. InvisiSpec: Making Speculative Execution
Invisible in the Cache Hierarchy. In MICRO. IEEE Computer Society, 428ś441.
[40] Jiyong Yu, Namrata Mantri, Josep Torrellas, Adam Morrison, and Christopher W.
Fletcher. 2020. Speculative Data-Oblivious Execution: Mobilizing Safe Prediction
For Safe and Efficient Speculative Execution. https://doi.org/10.1109/ISCA45697.
2020.00064
[41] Jiyong Yu, Mengjia Yan, Artem Khyzha, Adam Morrison, Josep Torrellas, and
Christopher W. Fletcher. 2019. Speculative Taint Tracking (STT): A Comprehensive Protection for Speculatively Accessed Data. In Proceedings of the 52Nd
Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’52).
ACM, New York, NY, USA, 954ś968. https://doi.org/10.1145/3352460.3358274
event-place: Columbus, OH, USA.