Fosdem 2018
Fosdem 2018
Fosdem 2018
• An Instruction Set Architecture (ISA) describes the contract between hardware and software
• Defines the instructions that all machines implementing the architecture must support
• Load/Store from memory, architectural registers, stack, branches/control flow
• Arithmetic, floating point, vector operations, and various possible extensions
• Defines user (unprivileged, problem state) and supervisor (privileged) execution states
• Exception levels used for software exceptions and hardware interrupts
• Privileged registers used by the Operating System for system management
• Mechanisms for application task context management and switching
• Defines the memory model used by machines compliant with the ISA
• The lowest level targeted by an application programmer or (more often) compiler
• Operating System software makes use of additional privileged set of ISA instructions
• These include instructions to manage application context (registers, MMU state, etc.)
• e.g. on x86 this includes being able to set the CR3 (page table base) control register that hardware
uses to automatically translate virtual addresses into physical memory addresses
• Operating System software is responsible for switching between applications
• Save the process state (including registers), update the control registers
• Operating System software maintains application page tables
• The hardware triggers a “page fault” whenever a virtual address is inaccessible
• This could be because an application has been partially “swapped” (paged) out to disk, is being
demand loaded, or because the application does not have permission to access that address
C1 C2 C1 C2
D D
L2 $ L2 $
D D
R R
LLC
M M
E E
M C1 C2 C1 C2 M
L2 $ L2 $
• Programmers often think in terms of “processors” by which they usually mean “cores”
• Some cores are “multi-threaded” (SMT) sharing execution resources between two threads
• Minimal context separation is maintained through some (lightweight) duplication
• Many cores are integrated into today's processor packages (SoCs)
• These are connected using interconnect(ion) networks and cache coherency protocols
• Provides a hardware managed coherent view of system memory shared between cores
• Memory controllers handle load/store of program instructions and data to/from RAM
• Manage scheduling of DDR (or other memory) and sometimes leverage hardware access hints
• Cache hierarchy sits between external (slow) RAM and (much faster) processor cores
• Progressively tighter coupling from LLC (L3) through to L1 running at core speed
Instruction Fetch
Branch
Predictor
Instruction Decode
Instruction Execute
L1
Register
Memory Access D$
File
Writeback
• This is the classical “RISC” pipeline often taught first in computer architecture courses
• Pipelining means instruction processing is split into multiple clock cycles
• Multiple instructions may be at different “stages” in the pipeline simultaneously
• 1. Instructions are fetched from a dedicated L1 Instruction Cache (I$)
• L1 cache automatically fills cache lines from “unified” L2/LLC on demand
• 2. Instructions are then decoded according to the ISA defined set of “encodings”
• e.g. “add r3, r1, r2”
• 3. Instructions are executed by the execution units
• 4. Memory access is performed to/from the dedicated L1 Data Cache (D$)
• 5. The architectural register file is updated
• e.g. r3 becomes the result of r1 + r2
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
• An in-order machine can suffer from pipeline stalls when stages are not ready
• The memory access stage may be able to load from the L1 D$ in a single cycle
• But if it is not in the L1 D$ then we insert a pipeline “bubble” while we wait for the data
• This may take many additional cycles while the data is fetched from further away
• Limited capability to hide latency of instructions
• Future instructions may not be dependent upon stalling earlier instructions
• Limited branch prediction depending upon implementation
• Typically squash a few pipeling stages and/or stall for data
Execution Units
L1 D$
R1 = LOAD A R1 R2
R2 = LOAD B
R3
R3 = R1 + R2 No data dependency
R1 = 1
R1 R2
R2 = 1
R3 = R1 + R2 R3
R1 = LOAD A 1 P1 = R1 P1 = LOAD A X Y
R2 = LOAD B 2 P2 = R2 P2 = LOAD B X Y
R3 = R1 + R2 3 P3 = R3 P3 = R1 + R2 1,2 N
R1 = 1 4 P4 = R1 P4 = 1 X Y
R2 = 1 5 P5 = R2 P5 = 1 X Y
R3 = R1 + R2 6 P6 = R3 P6 = P4 + P5 4,5 N
• User applications are known as “processes” (or “tasks”) when they are running
• They run in “userspace”, a less privileged context with many restrictions imposed
• Managed through special hardware interfaces (registers) as well as other structures
• We will look at an example of how “page tables” isolate kernel and userspace shortly
• Applications make “system calls” into the kernel to request services
• For example “open” a file or “read” some bytes from an open file
• Enter the kernel briefly using a hardware provided mechanism (syscall interface)
• A great amount of optimization has gone into making this a lightweight entry/exit
• Special optimizations exist for some frequently used kernel services
• VDSO (Virtual Dynamic Shared Object) looks like a shared library but provided by kernel
• When you do a gettimeofday (GTOD) call you actually won't need to enter the kernel
• Memory accesses are translated (possibly multiple times) before reaching memory
• Applications use virtual addresses (VAs) that are managed at page-sized granularity
• A VA may be mapped to an intermediate address if a Hypervisor is in use
• Either the Hypervisor or Operating System kernel manages physical translations
• Translations use hardware-assisted page table walkers that traverse page tables
• The Operating System creates and manages the page tables for each application
• Hardware manages TLBs (Translation Lookaside Buffers) filled with recent translations
• The collection of currently valid addresses is known as a (virtual) address space
• On “context switch” from one process to another, page table base pointers are swapped, and
existing TLB entries are invalidated. Cache flushing may be required depending upon the use of
address space IDs (ASIDs, PCIDs, etc.) in the architecture and the Operating System
C1 C2 C1 C2
D D
L2 $ L2 $
D D
R R
LLC
M M
E E
M C1 C2 C1 C2 M
L2 $ L2 $
• Caches exist because the principal of locality says recently used data is likely to be used again
• Unfortunately we have a choice between “small and fast” and “large and slow”
• Levels of cache provide the best of both, replacement policies handle cache eviction
• Caches are organized into sets where each set can contain multiple cache lines
• A typical cache line is 64 or 128 bytes and represents a block of memory
• A typical memory block will map to single cache set, but can be in any “way” of a set
• Caches may be direct mapped or (fully) associative depending upon complexity
• Direct mapped allows one memory location to exist only in a specific cache location
• Associative caches allow one memory location to map to one of N cache locations
• “In computer security, a side-channel attack is any attack based on information gained from
the physical implementation of a computer system, rather than weaknesses in the implemented
algorithm itself (e.g. cryptanalysis and software bugs).” – from the Wikipedia definition
• Caches exist fundamentally because they provide faster access to frequently used data
• The closer data is to the compute cores, the less time is required to load it when needed
• This difference in access time for a given address can be measured by software
• Data closer to the cores will take fewer cycles to access
• Data further away from the cores will take more cycles to access
• Consequently it is possible to determine whether a specific address is in the cache
• Calibrate by measuring access time for known cached/not cached data
• Time access to a memory location and compare with calibration
• Some processors provide a means to prefetch data that will be needed soon
• Usually encoded as “hint” or “nop space” instructions that may have no effect
• x86 processors provide several variants of PREFETCH with a temporal hint
• This may result in a prefetched address being allocated into a cache
• Processors will perform page table walks and populate TLBs on prefetch
• This may happen even if the address is not actually fetched into the cache
asm volatile ("prefetcht0 (%0)" : : "r" (p));
asm volatile ("prefetcht1 (%0)" : : "r" (p));
asm volatile ("prefetcht2 (%0)" : : "r" (p));
asm volatile ("prefetchnta (%0)" : : "r" (p));
True
FLAGS? take_umbrella()
False
R1 = LOAD A 1 P1 = R1 P1 = LOAD A X Y N
TEST R1 2 TEST R1 1 Y N
IF R1 ZERO { 3 IF R1 ZERO { 1 N N
R1 = 1 4 P2 = R1 P4 = 1 X Y Y*
R2 = 1 5 P3 = R2 P5 = 1 X Y Y*
R3 = R1 + R2 6 P4 = R3 P4 = P2 + P3 4,5 Y Y*
• A conditional branch will be performed based upon the state of the condition flags
• Condition flags are commonly implemented in modern ISAs and set by certain instructions
• Some ISAs are optimized to set the condition flags only in specific instruction variants
• Most loops are implemented as a conditional backward jump following a test:
Process A Process B
0x5000 BRANCH A 0x5000 BRANCH B
0x000 T,T,N,N,T,T,N,N
• Branch behavior is rarely random and can usually be predicted with high accuracy
• Branch predictor is first “trained” using historical direction to predict future
• Over 99% accuracy is possible depending upon the branch predictor sophistication
• Branches are identified based upon the (virtual) address of the branch instruction
• Index into branch prediction structure containing pattern history e.g. T,T,N,N,T,T,N,N
• These may be tagged during instruction fetch/decode using extra bits in the I$
• Most contemporary high performance branch predictors combine local/global history
• Recognizing that branches are rarely independent and usually have some correlation
• A Global History Register is combined with saturating counters for each history entry
• May also hash GHR with address of the branch instruction (e.g. “Gshare “ predictor)
$ grep . /sys/devices/system/cpu/vulnerabilities/*
/sys/devices/system/cpu/vulnerabilities/meltdown:Mitigation: PTI
/sys/devices/system/cpu/vulnerabilities/spectre_v1:Vulnerable
/sys/devices/system/cpu/vulnerabilities/spectre_v2:Vulnerable: Minimal
generic ASM retpoline
• Implementations of Out-of-Order execution that strictly follow the original Tomasulo algorthim
handle exceptions arising from speculatively executed instructions at instruction retirement
• Speculated instructions do not trigger (synchronous) exceptions in response to execution
• Loads that are not permitted will not be reported until they are no longer speculative
• At that time, the application will likely receive a “segmentation fault” or other error
• Some implementations may perform load permission checks in parallel with the load
• This improves performance and the rationale is that the load is only speculative
• A race condition may thus exist allowing access to privileged data
• A malicious attacker arranges for exploit code similar to the following to speculatively execute:
if (spec_cond) {
unsigned char value = *(unsigned char *)ptr;
unsigned long index2 = (((value>>bit)&1)*0x100)+0x200;
maccess(&data[index2]);
}
• “data” is a user controller array to which the attacker has access, “ptr” contains privileged data
0x300
• Access to “data” element 0x100 pulls the corresponding entry into the cache
char data[];
0x000 Cache
0x100 0x100
0x200
0x300 DATA
• Access to “data” element 0x300 pulls the corresponding entry into the cache
char data[];
0x000
0x100
0x200 Cache
0x300 DATA 0x300
• We use the cache as a side channel to determine which element of “data” is in the cache
• Access both elements and time the difference in access (we previously flushed them)
time = rdtsc();
maccess(&data[0x300]); Execution time taken for
delta3 = rdtsc() - time; instruction is proportional
to whether it is in cache(s)
time = rdtsc();
maccess(&data[0x200]);
delta2 = rdtsc() - time;
• A malicious attacker arranges for exploit code similar to the following to speculatively execute:
if (spec_cond) {
unsigned char value = *(unsigned char *)ptr;
unsigned long index2 = (((value>>bit)&1)*0x100)+0x200;
maccess(&data[index2]);
}
• “data” is a user controller array to which the attacker has access, “ptr” contains privileged data
• A malicious attacker arranges for exploit code similar to the following to speculatively execute:
if (spec_cond) {
unsigned char value = *(unsigned char *)ptr;
unsigned long index2 = (((value>>bit)&1)*0x100)+0x200;
maccess(&data[index2]);
}
• A malicious attacker arranges for exploit code similar to the following to speculatively execute:
if (spec_cond) {
unsigned char value = *(unsigned char *)ptr;
unsigned long index2 = (((value>>bit)&1)*0x100)+0x200;
maccess(&data[index2]);
}
Generate address
from data value
• When the right conditions exist, this branch of code will run speculatively
• The privilege check for “value” will fail, but only result in an entry tag in the ROB
• The access will occur although “value” will be discarded when speculation is undone
• The offset accessed in the “data” user array is dependent upon the value of privileged data
• We can use this as a 1 bit counter between several possible entries of the user data array
• Cache side channel timing analysis is used to measure which “data” location was accessed
• Time access to “data” locations 0x200 and 0x300 to infer value of desired bit
• Access is done in reverse in my code to account for cache line prefetcher
• Linux calls this page table separation “PTI”: Page Table Isolation
• Requires an expensive write to core control registers on every entry/exit from the OS kernel
• e.g. TTBR write on impacted ARMv8, CR3 on impacted x86 processors
• Only enabled by default on known-vulnerable microprocessors
• An enumeration is defined to discover future non-impacted silicon
• Address Space IDentifiers (ASIDs) can significantly improve performance
• ASIDs on ARMv8, PCIDs (Process Context IDs) on x86 processors
• TLB entries are tagged with address space so a full invalidation isn't required
• Significant performance delta between older (pre-2010 x86) cores and newer ones
• The code following the bounds check is known as a “gadget” (see ROP attacks)
• Existing code contained within a different victim context (e.g. Operating System/Hypervisor)
• Code following the untrusted_offset bounds check may be executed speculatively
• Resulting in the speculative loading of trusted data into a local variable
• This trusted data is used to calculate an offset into another structure
• Relative offset of other_data accessed can be used to infer trusted_value
• L1D$ cache load will occur for other_data at an offset correlated with trusted_value
• Measure which cache location was loaded speculatively to infer the secret value
0x000 T,T,N,N,T,T,N,N
• Modern microprocessors are extremely complex machines requiring huge capital investment
• A high performance core might require a 300+ person team, and 4 years of engineering effort
• Consequently the ability to handle potential issues in the field is extremely compelling
• Modern cores provide thousands of hidden tunable knobs (chicken bits) that allow a design
team to “chicken out” and disable certain features that aren't working in whole or in part
• A high performance core might have as many as 10,000 different chicken bits available
• A chicken bit might be programmed in firmware prior to system boot
• e.g. “disable all indirect branch prediction when in privileged state” (if this is possible)
• Or it might be exposed to the Operating System to poke it as needed
call set_up_target;
capture_spec:
pause;
jmp capture_spec;
set_up_target:
Modify return stack to
mov %r11, (%rsp);
force “return” to target
ret;
call set_up_target;
capture_spec: Harmless infinite loop for
pause;
the CPU to speculate :)
jmp capture_spec;
set_up_target:
mov %r11, (%rsp);
ret;
• Variations of these microarchitecture attacks are likely to be found for many years
• An example is known as “variant 3a”. Some microprocessors will allow speculative read of
privileged system registers to which an application should not normally have access
• Can be used to determine the address of key structures such as page table base registers
linkedin.com/company/red-hat twitter.com/RedHatNews
youtube.com/user/RedHatVideos
Last slide
• Please ensure that the
following image is used
as the last visual in
your demo
Exploiting modern microarchitectures: M 92
INTERNAL ONLY
eltdown, Spectre,|and
PRESENTER
other attacks NAME
Exploiting modern microarchitectures:
Meltdown, Spectre, and other attacks
Jon Masters, Computer Architect, Red Hat, Inc.
jcm@redhat.com | @jonmasters
• An Instruction Set Architecture (ISA) describes the contract between hardware and software
• Defines the instructions that all machines implementing the architecture must support
• Load/Store from memory, architectural registers, stack, branches/control flow
• Arithmetic, floating point, vector operations, and various possible extensions
• Defines user (unprivileged, problem state) and supervisor (privileged) execution states
• Exception levels used for software exceptions and hardware interrupts
• Privileged registers used by the Operating System for system management
• Mechanisms for application task context management and switching
• Defines the memory model used by machines compliant with the ISA
• The lowest level targeted by an application programmer or (more often) compiler
• Operating System software makes use of additional privileged set of ISA instructions
• These include instructions to manage application context (registers, MMU state, etc.)
• e.g. on x86 this includes being able to set the CR3 (page table base) control register that hardware
uses to automatically translate virtual addresses into physical memory addresses
• Operating System software is responsible for switching between applications
• Save the process state (including registers), update the control registers
• Operating System software maintains application page tables
• The hardware triggers a “page fault” whenever a virtual address is inaccessible
• This could be because an application has been partially “swapped” (paged) out to disk, is being
demand loaded, or because the application does not have permission to access that address
C1 C2 C1 C2
D D
L2 $ L2 $
D D
R R
LLC
M M
E E
M C1 C2 C1 C2 M
L2 $ L2 $
• Programmers often think in terms of “processors” by which they usually mean “cores”
• Some cores are “multi-threaded” (SMT) sharing execution resources between two threads
• Minimal context separation is maintained through some (lightweight) duplication
• Many cores are integrated into today's processor packages (SoCs)
• These are connected using interconnect(ion) networks and cache coherency protocols
• Provides a hardware managed coherent view of system memory shared between cores
• Memory controllers handle load/store of program instructions and data to/from RAM
• Manage scheduling of DDR (or other memory) and sometimes leverage hardware access hints
• Cache hierarchy sits between external (slow) RAM and (much faster) processor cores
• Progressively tighter coupling from LLC (L3) through to L1 running at core speed
Instruction Fetch
Branch
Predictor
Instruction Decode
Instruction Execute
L1
Register
Memory Access D$
File
Writeback
• This is the classical “RISC” pipeline often taught first in computer architecture courses
• Pipelining means instruction processing is split into multiple clock cycles
• Multiple instructions may be at different “stages” in the pipeline simultaneously
• 1. Instructions are fetched from a dedicated L1 Instruction Cache (I$)
• L1 cache automatically fills cache lines from “unified” L2/LLC on demand
• 2. Instructions are then decoded according to the ISA defined set of “encodings”
• e.g. “add r3, r1, r2”
• 3. Instructions are executed by the execution units
• 4. Memory access is performed to/from the dedicated L1 Data Cache (D$)
• 5. The architectural register file is updated
• e.g. r3 becomes the result of r1 + r2
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
• An in-order machine can suffer from pipeline stalls when stages are not ready
• The memory access stage may be able to load from the L1 D$ in a single cycle
• But if it is not in the L1 D$ then we insert a pipeline “bubble” while we wait for the data
• This may take many additional cycles while the data is fetched from further away
• Limited capability to hide latency of instructions
• Future instructions may not be dependent upon stalling earlier instructions
• Limited branch prediction depending upon implementation
• Typically squash a few pipeling stages and/or stall for data
Execution Units
L1 D$
R1 = LOAD A R1 R2
R2 = LOAD B
R3
R3 = R1 + R2 No data dependency
R1 = 1
R1 R2
R2 = 1
R3 = R1 + R2 R3
R1 = LOAD A 1 P1 = R1 P1 = LOAD A X Y
R2 = LOAD B 2 P2 = R2 P2 = LOAD B X Y
R3 = R1 + R2 3 P3 = R3 P3 = R1 + R2 1,2 N
R1 = 1 4 P4 = R1 P4 = 1 X Y
R2 = 1 5 P5 = R2 P5 = 1 X Y
R3 = R1 + R2 6 P6 = R3 P6 = P4 + P5 4,5 N
• User applications are known as “processes” (or “tasks”) when they are running
• They run in “userspace”, a less privileged context with many restrictions imposed
• Managed through special hardware interfaces (registers) as well as other structures
• We will look at an example of how “page tables” isolate kernel and userspace shortly
• Applications make “system calls” into the kernel to request services
• For example “open” a file or “read” some bytes from an open file
• Enter the kernel briefly using a hardware provided mechanism (syscall interface)
• A great amount of optimization has gone into making this a lightweight entry/exit
• Special optimizations exist for some frequently used kernel services
• VDSO (Virtual Dynamic Shared Object) looks like a shared library but provided by kernel
• When you do a gettimeofday (GTOD) call you actually won't need to enter the kernel
• Memory accesses are translated (possibly multiple times) before reaching memory
• Applications use virtual addresses (VAs) that are managed at page-sized granularity
• A VA may be mapped to an intermediate address if a Hypervisor is in use
• Either the Hypervisor or Operating System kernel manages physical translations
• Translations use hardware-assisted page table walkers that traverse page tables
• The Operating System creates and manages the page tables for each application
• Hardware manages TLBs (Translation Lookaside Buffers) filled with recent translations
• The collection of currently valid addresses is known as a (virtual) address space
• On “context switch” from one process to another, page table base pointers are swapped, and
existing TLB entries are invalidated. Cache flushing may be required depending upon the use of
address space IDs (ASIDs, PCIDs, etc.) in the architecture and the Operating System
C1 C2 C1 C2
D D
L2 $ L2 $
D D
R R
LLC
M M
E E
M C1 C2 C1 C2 M
L2 $ L2 $
• Caches exist because the principal of locality says recently used data is likely to be used again
• Unfortunately we have a choice between “small and fast” and “large and slow”
• Levels of cache provide the best of both, replacement policies handle cache eviction
• Caches are organized into sets where each set can contain multiple cache lines
• A typical cache line is 64 or 128 bytes and represents a block of memory
• A typical memory block will map to single cache set, but can be in any “way” of a set
• Caches may be direct mapped or (fully) associative depending upon complexity
• Direct mapped allows one memory location to exist only in a specific cache location
• Associative caches allow one memory location to map to one of N cache locations
• “In computer security, a side-channel attack is any attack based on information gained from
the physical implementation of a computer system, rather than weaknesses in the implemented
algorithm itself (e.g. cryptanalysis and software bugs).” – from the Wikipedia definition
• Caches exist fundamentally because they provide faster access to frequently used data
• The closer data is to the compute cores, the less time is required to load it when needed
• This difference in access time for a given address can be measured by software
• Data closer to the cores will take fewer cycles to access
• Data further away from the cores will take more cycles to access
• Consequently it is possible to determine whether a specific address is in the cache
• Calibrate by measuring access time for known cached/not cached data
• Time access to a memory location and compare with calibration
• Some processors provide a means to prefetch data that will be needed soon
• Usually encoded as “hint” or “nop space” instructions that may have no effect
• x86 processors provide several variants of PREFETCH with a temporal hint
• This may result in a prefetched address being allocated into a cache
• Processors will perform page table walks and populate TLBs on prefetch
• This may happen even if the address is not actually fetched into the cache
asm volatile ("prefetcht0 (%0)" : : "r" (p));
asm volatile ("prefetcht1 (%0)" : : "r" (p));
asm volatile ("prefetcht2 (%0)" : : "r" (p));
asm volatile ("prefetchnta (%0)" : : "r" (p));
True
FLAGS? take_umbrella()
False
R1 = LOAD A 1 P1 = R1 P1 = LOAD A X Y N
TEST R1 2 TEST R1 1 Y N
IF R1 ZERO { 3 IF R1 ZERO { 1 N N
R1 = 1 4 P2 = R1 P4 = 1 X Y Y*
R2 = 1 5 P3 = R2 P5 = 1 X Y Y*
R3 = R1 + R2 6 P4 = R3 P4 = P2 + P3 4,5 Y Y*
• A conditional branch will be performed based upon the state of the condition flags
• Condition flags are commonly implemented in modern ISAs and set by certain instructions
• Some ISAs are optimized to set the condition flags only in specific instruction variants
• Most loops are implemented as a conditional backward jump following a test:
Process A Process B
0x5000 BRANCH A 0x5000 BRANCH B
0x000 T,T,N,N,T,T,N,N
• Branch behavior is rarely random and can usually be predicted with high accuracy
• Branch predictor is first “trained” using historical direction to predict future
• Over 99% accuracy is possible depending upon the branch predictor sophistication
• Branches are identified based upon the (virtual) address of the branch instruction
• Index into branch prediction structure containing pattern history e.g. T,T,N,N,T,T,N,N
• These may be tagged during instruction fetch/decode using extra bits in the I$
• Most contemporary high performance branch predictors combine local/global history
• Recognizing that branches are rarely independent and usually have some correlation
• A Global History Register is combined with saturating counters for each history entry
• May also hash GHR with address of the branch instruction (e.g. “Gshare “ predictor)
$ grep . /sys/devices/system/cpu/vulnerabilities/*
/sys/devices/system/cpu/vulnerabilities/meltdown:Mitigation: PTI
/sys/devices/system/cpu/vulnerabilities/spectre_v1:Vulnerable
/sys/devices/system/cpu/vulnerabilities/spectre_v2:Vulnerable: Minimal
generic ASM retpoline
• Implementations of Out-of-Order execution that strictly follow the original Tomasulo algorthim
handle exceptions arising from speculatively executed instructions at instruction retirement
• Speculated instructions do not trigger (synchronous) exceptions in response to execution
• Loads that are not permitted will not be reported until they are no longer speculative
• At that time, the application will likely receive a “segmentation fault” or other error
• Some implementations may perform load permission checks in parallel with the load
• This improves performance and the rationale is that the load is only speculative
• A race condition may thus exist allowing access to privileged data
• A malicious attacker arranges for exploit code similar to the following to speculatively execute:
if (spec_cond) {
unsigned char value = *(unsigned char *)ptr;
unsigned long index2 = (((value>>bit)&1)*0x100)+0x200;
maccess(&data[index2]);
}
• “data” is a user controller array to which the attacker has access, “ptr” contains privileged data
0x300
• Access to “data” element 0x100 pulls the corresponding entry into the cache
char data[];
0x000 Cache
0x100 0x100
0x200
0x300 DATA
• Access to “data” element 0x300 pulls the corresponding entry into the cache
char data[];
0x000
0x100
0x200 Cache
0x300 DATA 0x300
• We use the cache as a side channel to determine which element of “data” is in the cache
• Access both elements and time the difference in access (we previously flushed them)
time = rdtsc();
maccess(&data[0x300]); Execution time taken for
delta3 = rdtsc() - time; instruction is proportional
to whether it is in cache(s)
time = rdtsc();
maccess(&data[0x200]);
delta2 = rdtsc() - time;
• A malicious attacker arranges for exploit code similar to the following to speculatively execute:
if (spec_cond) {
unsigned char value = *(unsigned char *)ptr;
unsigned long index2 = (((value>>bit)&1)*0x100)+0x200;
maccess(&data[index2]);
}
• “data” is a user controller array to which the attacker has access, “ptr” contains privileged data
• A malicious attacker arranges for exploit code similar to the following to speculatively execute:
if (spec_cond) {
unsigned char value = *(unsigned char *)ptr;
unsigned long index2 = (((value>>bit)&1)*0x100)+0x200;
maccess(&data[index2]);
}
• A malicious attacker arranges for exploit code similar to the following to speculatively execute:
if (spec_cond) {
unsigned char value = *(unsigned char *)ptr;
unsigned long index2 = (((value>>bit)&1)*0x100)+0x200;
maccess(&data[index2]);
}
Generate address
from data value
• When the right conditions exist, this branch of code will run speculatively
• The privilege check for “value” will fail, but only result in an entry tag in the ROB
• The access will occur although “value” will be discarded when speculation is undone
• The offset accessed in the “data” user array is dependent upon the value of privileged data
• We can use this as a 1 bit counter between several possible entries of the user data array
• Cache side channel timing analysis is used to measure which “data” location was accessed
• Time access to “data” locations 0x200 and 0x300 to infer value of desired bit
• Access is done in reverse in my code to account for cache line prefetcher
• Linux calls this page table separation “PTI”: Page Table Isolation
• Requires an expensive write to core control registers on every entry/exit from the OS kernel
• e.g. TTBR write on impacted ARMv8, CR3 on impacted x86 processors
• Only enabled by default on known-vulnerable microprocessors
• An enumeration is defined to discover future non-impacted silicon
• Address Space IDentifiers (ASIDs) can significantly improve performance
• ASIDs on ARMv8, PCIDs (Process Context IDs) on x86 processors
• TLB entries are tagged with address space so a full invalidation isn't required
• Significant performance delta between older (pre-2010 x86) cores and newer ones
• The code following the bounds check is known as a “gadget” (see ROP attacks)
• Existing code contained within a different victim context (e.g. Operating System/Hypervisor)
• Code following the untrusted_offset bounds check may be executed speculatively
• Resulting in the speculative loading of trusted data into a local variable
• This trusted data is used to calculate an offset into another structure
• Relative offset of other_data accessed can be used to infer trusted_value
• L1D$ cache load will occur for other_data at an offset correlated with trusted_value
• Measure which cache location was loaded speculatively to infer the secret value
0x000 T,T,N,N,T,T,N,N
• Modern microprocessors are extremely complex machines requiring huge capital investment
• A high performance core might require a 300+ person team, and 4 years of engineering effort
• Consequently the ability to handle potential issues in the field is extremely compelling
• Modern cores provide thousands of hidden tunable knobs (chicken bits) that allow a design
team to “chicken out” and disable certain features that aren't working in whole or in part
• A high performance core might have as many as 10,000 different chicken bits available
• A chicken bit might be programmed in firmware prior to system boot
• e.g. “disable all indirect branch prediction when in privileged state” (if this is possible)
• Or it might be exposed to the Operating System to poke it as needed
call set_up_target;
capture_spec:
pause;
jmp capture_spec;
set_up_target:
Modify return stack to
mov %r11, (%rsp);
force “return” to target
ret;
call set_up_target;
capture_spec: Harmless infinite loop for
pause;
the CPU to speculate :)
jmp capture_spec;
set_up_target:
mov %r11, (%rsp);
ret;
• Variations of these microarchitecture attacks are likely to be found for many years
• An example is known as “variant 3a”. Some microprocessors will allow speculative read of
privileged system registers to which an application should not normally have access
• Can be used to determine the address of key structures such as page table base registers
linkedin.com/company/red-hat twitter.com/RedHatNews
youtube.com/user/RedHatVideos