Search | arXiv e-print repository

ECI: a Customizable Cache Coherency Stack for Hybrid FPGA-CPU Architectures

Authors: Abishek Ramdas, Michael Giardino, Runbin Shi, Adam Turowski, David Cock, Gustavo Alonso, Timothy Roscoe

Abstract: Unlike other accelerators, FPGAs are capable of supporting cache coherency, thereby turning them into a more powerful architectural option than just a peripheral accelerator. However, most existing deployments of FPGAs are either non-cache coherent or support only an asymmetric design where cache coherency is controlled from the CPU. Taking advantage of a recently released two socket CPU-FPGA arch… ▽ More Unlike other accelerators, FPGAs are capable of supporting cache coherency, thereby turning them into a more powerful architectural option than just a peripheral accelerator. However, most existing deployments of FPGAs are either non-cache coherent or support only an asymmetric design where cache coherency is controlled from the CPU. Taking advantage of a recently released two socket CPU-FPGA architecture, in this paper we describe ECI, a flexible implementation of cache coherency on the FPGA capable of supporting both symmetric and asymmetric protocols. ECI is open and customizable, given applications the opportunity to fully interact with the cache coherency protocol, thereby opening up many interesting system design and research opportunities not available in existing designs. Through extensive microbenchmarks we show that ECI exhibits highly competitive performance and discuss in detail one use-case illustrating the benefits of having an open cache coherency stack on the FPGA. △ Less

Submitted 15 August, 2022; originally announced August 2022.

arXiv:2009.02737 [pdf, other]

Secure Memory Management on Modern Hardware

Authors: Reto Achermann, Nora Hossle, Lukas Humbel, Daniel Schwyn, David Cock, Timothy Roscoe

Abstract: Almost all modern hardware, from phone SoCs to high-end servers with accelerators, contain memory translation and protection hardware like IOMMUs, firewalls, and lookup tables which make it impossible to reason about, and enforce protection and isolation based solely on the processor's MMUs. This has led to numerous bugs and security vulnerabilities in today's system software. In this paper we r… ▽ More Almost all modern hardware, from phone SoCs to high-end servers with accelerators, contain memory translation and protection hardware like IOMMUs, firewalls, and lookup tables which make it impossible to reason about, and enforce protection and isolation based solely on the processor's MMUs. This has led to numerous bugs and security vulnerabilities in today's system software. In this paper we regain the ability to reason about and enforce access control using the proven concept of a reference monitor mediating accesses to memory resources. We present a fine-grained, realistic memory protection model that makes this traditional concept applicable today, and bring system software in line with the complexity of modern, heterogeneous hardware. Our design is applicable to any operating system, regardless of architecture. We show that it not only enforces the integrity properties of a system, but does so with no inherent performance overhead and it is even amenable to automation through code generation from trusted hardware specifications. △ Less

Submitted 6 September, 2020; originally announced September 2020.

ACM Class: D.4; D.4.6

arXiv:1911.08773 [pdf, other]

CleanQ: a lightweight, uniform, formally specified interface for intra-machine data transfer

Authors: Roni Haecki, Lukas Humbel, Reto Achermann, David Cock, Daniel Schwyn, Timothy Roscoe

Abstract: We present CleanQ, a high-performance operating-system interface for descriptor-based data transfer with rigorous formal semantics, based on a simple, formally-verified notion of ownership transfer, with a fast reference implementation. CleanQ aims to replace the current proliferation of similar, but subtly diverse, and loosely specified, descriptor-based interfaces in OS kernels and device driver… ▽ More We present CleanQ, a high-performance operating-system interface for descriptor-based data transfer with rigorous formal semantics, based on a simple, formally-verified notion of ownership transfer, with a fast reference implementation. CleanQ aims to replace the current proliferation of similar, but subtly diverse, and loosely specified, descriptor-based interfaces in OS kernels and device drivers. CleanQ has strict semantics that not only clarify both the implementation of the interface for different hardware devices and software usecases, but also enable composition of modules as in more heavyweight frameworks like Unix streams. We motivate CleanQ by showing that loose specifications derived from implementation lead to security and correctness bugs in production systems that a clean, formal, and easilyunderstandable abstraction helps eliminate. We further demonstrate by experiment that there is negligible performance cost for a clean design: we show overheads in the tens of cycles for operations, and comparable end-to-end performance to the highly-tuned Virtio and DPDK implementations on Linux. △ Less

Submitted 20 November, 2019; originally announced November 2019.

arXiv:1911.08367 [pdf, other]

Cichlid: Explicit physical memory management for large machines

Authors: Simon Gerber, Gerd Zellweger, Reto Achermann, Moritz Hoffmann, Kornilios Kourtis, Timothy Roscoe, Dejan Milojicic

Abstract: In this paper, we rethink how an OS supports virtual memory. Classical VM is an opaque abstraction of RAM, backed by demand paging. However, most systems today (from phones to data-centers) do not page, and indeed may require the performance benefits of non-paged physical memory, precise NUMA allocation, etc. Moreover, MMU hardware is now useful for other purposes, such as detecting page access or… ▽ More In this paper, we rethink how an OS supports virtual memory. Classical VM is an opaque abstraction of RAM, backed by demand paging. However, most systems today (from phones to data-centers) do not page, and indeed may require the performance benefits of non-paged physical memory, precise NUMA allocation, etc. Moreover, MMU hardware is now useful for other purposes, such as detecting page access or providing large page translation. Accordingly, the venerable VM abstraction in OSes like Windows and Linux has acquired a plethora of extra APIs to poke at the policy behind the illusion of a virtual address space. Instead, we present Cichlid, a memory system which inverts this model. Applications explicitly manage their physical RAM of different types, and directly (though safely) program the translation hardware. Cichlid is implemented in Barrelfish, requires no virtualization support, and outperforms VMM-based approaches for all but the smallest working sets. We show that Cichlid enables use-cases for virtual memory not possible in Linux today, and other use-cases are simple to program and significantly faster. △ Less

Submitted 19 November, 2019; originally announced November 2019.

arXiv:1910.05398 [pdf, other]

Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines

Authors: Reto Achermann, Ashish Panwar, Abhishek Bhattacharjee, Timothy Roscoe, Jayneel Gandhi

Abstract: Multi-socket machines with 1-100 TBs of physical memory are becoming prevalent. Applications running on multi-socket machines suffer non-uniform bandwidth and latency when accessing physical memory. Decades of research have focused on data allocation and placement policies in NUMA settings, but there have been no studies on the question of how to place page-tables amongst sockets. We make the case… ▽ More Multi-socket machines with 1-100 TBs of physical memory are becoming prevalent. Applications running on multi-socket machines suffer non-uniform bandwidth and latency when accessing physical memory. Decades of research have focused on data allocation and placement policies in NUMA settings, but there have been no studies on the question of how to place page-tables amongst sockets. We make the case for explicit page-table allocation policies and show that page-table placement is becoming crucial to overall performance. We propose Mitosis to mitigate NUMA effects on page-table walks by transparently replicating and migrating page-tables across sockets without application changes. This reduces the frequency of accesses to remote NUMA nodes when performing page-table walks. Mitosis uses two components: (i) a mechanism to enable efficient page-table replication and migration; and (ii) policies for processes to efficiently manage and control page-table replication and migration. We implement Mitosis in Linux and evaluate its benefits on real hardware. Mitosis improves performance for large-scale multi-socket workloads by up to 1.34x by replicating page-tables across sockets. Moreover, it improves performance by up to 3.24x in cases when the OS migrates a process across sockets by enabling cross-socket page-table migration. △ Less

Submitted 8 November, 2019; v1 submitted 11 October, 2019; originally announced October 2019.

arXiv:1908.08707 [pdf, other]

A Least-Privilege Memory Protection Model for Modern Hardware

Authors: Reto Achermann, Nora Hossle, Lukas Humbel, Daniel Schwyn, David Cock, Timothy Roscoe

Abstract: We present a new least-privilege-based model of addressing on which to base memory management functionality in an OS for modern computers like phones or server-based accelerators. Existing software assumptions do not account for heterogeneous cores with different views of the address space, leading to the related problems of numerous security bugs in memory management code (for example programming… ▽ More We present a new least-privilege-based model of addressing on which to base memory management functionality in an OS for modern computers like phones or server-based accelerators. Existing software assumptions do not account for heterogeneous cores with different views of the address space, leading to the related problems of numerous security bugs in memory management code (for example programming IOMMUs), and an inability of mainstream OSes to securely manage the complete set of hardware resources on, say, a phone System-on-Chip. Our new work is based on a recent formal model of address translation hardware which views the machine as a configurable network of address spaces. We refine this to capture existing address translation hardware from modern SoCs and accelerators at a sufficiently fine granularity to model minimal rights both to access memory and configure translation hardware. We then build an executable specification in Haskell, which expresses the model and metadata structures in terms of partitioned capabilities. Finally, we show a fully functional implementation of the model in C created by extending the capability system of the Barrelfish research OS. Our evaluation shows that our unoptimized implementation has comparable (and in some cases) better performance than the Linux virtual memory system, despite both capturing all the functionality of modern hardware addressing and enabling least-privilege, decentralized authority to access physical memory and devices. △ Less

Submitted 23 August, 2019; originally announced August 2019.

arXiv:1812.02639 [pdf, other]

Shared Arrangements: practical inter-query sharing for streaming dataflows

Authors: Frank McSherry, Andrea Lattuada, Malte Schwarzkopf, Timothy Roscoe

Abstract: Current systems for data-parallel, incremental processing and view maintenance over high-rate streams isolate the execution of independent queries. This creates unwanted redundancy and overhead in the presence of concurrent incrementally maintained queries: each query must independently maintain the same indexed state over the same input streams, and new queries must build this state from scratch… ▽ More Current systems for data-parallel, incremental processing and view maintenance over high-rate streams isolate the execution of independent queries. This creates unwanted redundancy and overhead in the presence of concurrent incrementally maintained queries: each query must independently maintain the same indexed state over the same input streams, and new queries must build this state from scratch before they can begin to emit their first results. This paper introduces shared arrangements: indexed views of maintained state that allow concurrent queries to reuse the same in-memory state without compromising data-parallel performance and scaling. We implement shared arrangements in a modern stream processor and show order-of-magnitude improvements in query response time and resource consumption for interactive queries against high-throughput streams, while also significantly improving performance in other domains including business analytics, graph processing, and program analysis. △ Less

Submitted 11 June, 2020; v1 submitted 6 December, 2018; originally announced December 2018.

arXiv:1812.01371 [pdf, other]

Megaphone: Latency-conscious state migration for distributed streaming dataflows

Authors: Moritz Hoffmann, Andrea Lattuada, Frank McSherry, Vasiliki Kalavri, John Liagouris, Timothy Roscoe

Abstract: We design and implement Megaphone, a data migration mechanism for stateful distributed dataflow engines with latency objectives. When compared to existing migration mechanisms, Megaphone has the following differentiating characteristics: (i) migrations can be subdivided to a configurable granularity to avoid latency spikes, and (ii) migrations can be prepared ahead of time to avoid runtime coordin… ▽ More We design and implement Megaphone, a data migration mechanism for stateful distributed dataflow engines with latency objectives. When compared to existing migration mechanisms, Megaphone has the following differentiating characteristics: (i) migrations can be subdivided to a configurable granularity to avoid latency spikes, and (ii) migrations can be prepared ahead of time to avoid runtime coordination. Megaphone is implemented as a library on an unmodified timely dataflow implementation, and provides an operator interface compatible with its existing APIs. We evaluate Megaphone on established benchmarks with varying amounts of state and observe that compared to naïve approaches Megaphone reduces service latencies during reconfiguration by orders of magnitude without significantly increasing steady-state overhead. △ Less

Submitted 16 April, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

arXiv:1808.06893 [pdf, other]

DeltaPath: dataflow-based high-performance incremental routing

Authors: Desislava Dimitrova, John Liagouris, Sebastian Wicki, Moritz Hoffmann, Vasiliki Kalavri, Timothy Roscoe

Abstract: Routing controllers must react quickly to failures, reconfigurations and workload or policy changes, to ensure service performance and cost-efficient network operation. We propose a general execution model which views routing as an incremental data-parallel computation on a graph-based network model plus a continuous stream of network changes. Our approach supports different routing objectives wit… ▽ More Routing controllers must react quickly to failures, reconfigurations and workload or policy changes, to ensure service performance and cost-efficient network operation. We propose a general execution model which views routing as an incremental data-parallel computation on a graph-based network model plus a continuous stream of network changes. Our approach supports different routing objectives with only minor re-configuration of its core algorithm, and easily accomodates dynamic user-defined routing policies. Moreover, our prototype demonstrates excellent performance: on Google Jupiter topology it reacts with a median time of 350ms to link failures and serves more than two million path requests per second each with latency under 1ms. This is three orders-of-magnitude faster than the popular ONOS open-source SDN controller. △ Less

Submitted 21 August, 2018; originally announced August 2018.

arXiv:1703.06571 [pdf, other]

doi 10.4204/EPTCS.244.4

Formalizing Memory Accesses and Interrupts

Authors: Reto Achermann, Lukas Humbel, David Cock, Timothy Roscoe

Abstract: The hardware/software boundary in modern heterogeneous multicore computers is increasingly complex, and diverse across different platforms. A single memory access by a core or DMA engine traverses multiple hardware translation and caching steps, and the destination memory cell or register often appears at different physical addresses for different cores. Interrupts pass through a complex topology… ▽ More The hardware/software boundary in modern heterogeneous multicore computers is increasingly complex, and diverse across different platforms. A single memory access by a core or DMA engine traverses multiple hardware translation and caching steps, and the destination memory cell or register often appears at different physical addresses for different cores. Interrupts pass through a complex topology of interrupt controllers and remappers before delivery to one or more cores, each with specific constraints on their configurations. System software must not only correctly understand the specific hardware at hand, but also configure it appropriately at runtime. We propose a formal model of address spaces and resources in a system that allows us to express and verify invariants of the system's runtime configuration, and illustrate (and motivate) it with several real platforms we have encountered in the process of OS implementation. △ Less

Submitted 19 March, 2017; originally announced March 2017.

Comments: In Proceedings MARS 2017, arXiv:1703.05812

ACM Class: D.4; D.4.2; D.4.7; D.3.4

Journal ref: EPTCS 244, 2017, pp. 66-116

Showing 1–10 of 10 results for author: Roscoe, T