Virtualisation For The Masses:: Exposing KVM On Android
Virtualisation For The Masses:: Exposing KVM On Android
Virtualisation For The Masses:: Exposing KVM On Android
https://source.android.com/devices/architecture/kernel/generic-kernel-image
Virtualisation on Android today
tl;dr: It’s the Wild West of fragmentation
Security Functionality
Increased TCB and Unable to leverage
difficulty/cost in hardware
providing virtualisation
streamlined updates capabilities from
across devices within Android
The Armv8 exception model on paper “bit on the bus”
EL2 Hypervisor
Android Kernel
EL1 Kernel
(GKI)
EL2 Hypervisor
Android Kernel
EL1 Kernel
(GKI) ● Android kernel (GKI)
● Vendor modules
EL2 Hypervisor ● System and libraries
● Apps
sEL0 Trusted Trusted Trusted Trusted ● “Android”
App App App * App *
Secure world
Android Kernel
EL1 Kernel
(GKI) ● DRM, crypto, ...
● Third party OSes
EL2 Hypervisor ● Opaque blobs
● Per-device integration
Trusted Trusted Trusted Trusted
sEL0 App App App * App *
Secure world
Android Kernel
EL1 Kernel
(GKI) What about that hypervisor?
EL2 Hypervisor
Android Kernel
EL1 Kernel
(GKI) What about that hypervisor?
EL0 VMM App App App EL0 VMM App App App
० Threat model places the entire host kernel (and VMM via ioctl()s) into the TCB; host
has full access to guest memory.
○ This is a bit like “inverse Trustzone”
Big problem!
The threat model of Android is not aligned with the
current design of KVM.
Revisiting nVHE with “Protected KVM”
Android’s security model requires that guest data remains private even if the host kernel has
been compromised. Maybe nVHE isn’t so bad after all...
० Extend world-switch code at EL2 to manage stage-2 page-tables and guest state
० Install a stage-2 translation for the host kernel during boot before loading vendor modules
Generic
Generic Kernel Kernel GKI Modules Vendor
Generic Kernel Modules
and
Hypervisor
० Basically just context-switches EL1 and allows host kernel to run functions with
elevated privilege
० Tight coupling with host kernel is optimal for KVM’s threat model
Prior to 5.9, Linux offered #define kvm_call_hyp(f, ...)to run kernel functions
annotated with __hyp_text at EL2.
Executing at EL2 (< 5.9)
// C code to run at EL2 (arch/arm64/kvm/hyp/tlb.c)
void __hyp_text __kvm_flush_vm_context(void)
{
dsb(ishst);
__tlbi(alle1is);
if (icache_is_vpipt())
asm volatile("ic ialluis");
dsb(ish);
}
// Callsite
kvm_call_hyp(__kvm_flush_vm_context);
The EL2 object in 5.9/5.10
New threat model needs EL2 code to be self-contained & safe against compromised host kernel:
● Embed EL2 payload using separate ELF sections and symbol prefixing (similar to EFI stub)
● Fixed set of hypercalls rather than arbitrary function pointers
● Prior to de-privilege, host sets static keys and applies alternatives (one way switch)
● Following de-privilege, EL2 object no longer mapped for EL1
Symbol aliases created from “allowlist” of kernel symbols for use at EL2.
Executing at EL2 (5.9/5.10)
// C code to run at EL2 (arch/arm64/kvm/hyp/nvhe/tlb.c)
void __kvm_flush_vm_context(void)
{
[...]
}
// Callsite
kvm_call_hyp(__kvm_flush_vm_context);
Virtual memory at EL2 (without pKVM)
Today, the host kernel is trusted and therefore in control of the hypervisor virtual memory:
Makes it trivial for a compromised host kernel to bypass new hypervisor restrictions.
EL2 MM bootstrap (5.11?)
Allowing the host kernel to manipulate these page-tables breaks the revised security model:
० IOMMU support
○ Unfortunate reliance on SoC design and sensible hardware
० VMM will not be able to access guest state (including CPU registers and memory)
○ Negotiate shared memory regions with guest for virtio
○ Q: How is a guest initialised to begin with?
Template bootloader
० Guest must use crypto (e.g. fs-verity) as host can intercept data due to lack of
hardware memory encryption
० No shared-memory device?
○ Virtio assumes guest memory is shared with the host
https://www.kernel.org/doc/html/latest/filesystems/fsverity.html
Bounce-buffering via shared windows
“[...] indicates that the device can be used on a platform where device access to data in memory is limited
and/or translated.”
When set, causes a Linux guest to use the DMA API for virtio allocations.
० We then just need to allow the host to access the bounce buffer pages
○ Expose SHARE/UNSHARE hypercalls to the guest to update host stage-2.
○ Hook the set_memory_{decrypted,encrypted}() API to share/unshare bounce buffer
pages
Zero-copy transfers using shared memory
Bounce buffers force the copying of all I/O data through a
shared window:
● Complete mm bootstrap
● Stabilise user ABIs
● Settle on solution for zero-copy I/O
● Move more guest state up to EL2
● Memory poisoning
● SMC proxying
● Attestation
● Ballooning
● Integration with rest of Android
● Continue upstreaming...
Questions?
<android-kvm@google.com>