We are delighted to announce the highlight papers that will be part of the technical program of SYSTOR 2023.
Starlight: Fast Container Provisioning on the Edge and over the WAN (NSDI 2022)
Jun Lin Chen and Daniyal Liaqat, University of Toronto; Moshe Gabel, York University / University of Toronto; Eyal de Lara, University of Toronto
Containers, originally designed for cloud environments, are increasingly popular for provisioning workers outside a single datacenter, for example in mobile and edge computing. These settings, however, bring new challenges: high latency links, limited bandwidth, and resource-constrained workers. The result is longer provisioning times when deploying new workers or updating existing ones, much of it due to network traffic.
Our analysis shows that current piecemeal approaches to reducing provisioning time are not always sufficient, and can even make things worse as round-trip times grow. Rather, we find that the very same layer-based structure that makes containers easy to develop and use also makes it more difficult to optimize deployment. Addressing this issue thus requires rethinking the container deployment pipeline as a whole.
Based on our findings, I will present Starlight: an accelerator for container provisioning. Starlight decouples provisioning from development by redesigning the container deployment protocol, filesystem, and image storage format. Our evaluation using 21 popular containers shows that, on average, Starlight deploys and starts containers 3x faster than the current industry standard implementation while incurring no runtime overhead and negligible storage overhead. Finally, it requires no changes to the deployed application, is backwards compatible with existing workers, and uses standard container registries.
Starlight is open source and available at https://github.com/mc256/starlight .
**Updates since publication in NSDI 2022: We have continued to improve and develop Starlight, adding features such as Kubernetes integration, jointly optimizing multiple containers deployments, improved backend, and support for Helm charts and authentication. Starlight is now being used in industry, including as an crucial enabler in an unusual edge computing product. The talk will touch on these topics.**
SwiSh: Distributed Shared State Abstractions for Programmable Switches (NSDI 2022)
Lior Zeno, Technion; Dan R. K. Ports, Jacob Nelson, and Daehyeok Kim, Microsoft Research; Shir Landau Feibish, The Open University of Israel; Idit Keidar, Arik Rinberg, Alon Rashelbach, Igor De-Paula, and Mark Silberstein, Technion
We design and evaluate SwiSh, a distributed shared state management layer for data-plane P4 programs. SwiSh enables running scalable stateful distributed network functions on programmable switches entirely in the data-plane. We explore several schemes to build a shared variable abstraction, which differ in consistency, performance, and in-switch implementation complexity. We introduce the novel Strong Delayed-Writes (SDW) protocol which offers consistent snapshots of shared data-plane objects with semantics known as r-relaxed strong linearizability, enabling implementation of distributed concurrent sketches with precise error bounds.
We implement strong, eventual, and SDW consistency protocols in Tofino switches, and compare their performance in microbenchmarks and three realistic network functions, NAT, DDoS detector, and rate limiter. Our results show that the distributed state management in the data plane is practical, and outperforms centralized solutions by up to four orders of magnitude in update throughput and replication latency.
Privbox: Faster System Calls Through Sandboxed Privileged Execution (Usenix ATC ’22)
Dima Kuznetsov and Adam Morrison. Tel Aviv University
System calls are the main method for applications to request services from the operating system, but their invocation incurs considerable overhead, which has been aggravated by mitigation mechanisms for transient execution attacks. Proposed approaches for reducing system call overhead all break the semantic equivalence between system calls and regular function calls (e.g., by making system calls asynchronous), and so their adoption requires rearchitecting applications.
This paper proposes Privbox, a new approach for lightweight system calls that maintains the familiar synchronous, function-like system call model. Privbox allows an application to execute system call-intensive code in a semi-privileged, sandboxed execution mode, called a ‘privbox’. Semi-privileged execution is architecturally similar to the kernel’s privileged execution, which enables faster invocation of system calls, but the code is sandboxed to ensure that it cannot use its elevated privileges to compromise the system. We further propose semi-privileged access prevention (SPAP), a simple hardware architectural feature that alleviates much of Privbox’s instrumentation overhead.
We implement Privbox based on Linux and LLVM. Our evaluation on x86 (Intel Skylake) hardware shows that Privbox (1) speeds up system call invocation by 2.2 times; (2) can increase throughput of I/O-threaded applications by up to 1.7 times; and (3) can increase the throughput of real-world workloads such as Redis by up to 7.6% and 11%, without and with SPAP, respectively.
Scaling Open vSwitch with a Computational Cache (NSDI ’22)
Alon Rashelbach, Ori Rottenstreich and Mark Silberstein, Technion
Open vSwitch (OVS) is a widely used open-source virtual switch implementation. In this work, we seek to scale up OVS to support hundreds of thousands of OpenFlow rules by accelerating the core component of its data-path – the packet classification mechanism. To do so we use NuevoMatch, a recent algorithm that uses neural network inference to match packets, and promises significant scalability and performance benefits. We overcome the primary algorithmic challenge of the slow rule update rate in the vanilla NuevoMatch, speeding it up by over three orders of magnitude. This improvement enables two design options to integrate NuevoMatch with OVS: (1) using it as an extra caching layer in front of OVS’s megaflow cache, and (2) using it to completely replace OVS’s data-path while performing classification directly on OpenFlow rules, and obviating control-path upcalls. Our comprehensive evaluation on real-world packet traces and ClassBench rules demonstrates the geometric mean speedups of 1.9x and 12.3x for the first and second designs, respectively, for 500K rules, with the latter also supporting up to 60K OpenFlow rule updates/second, by far exceeding the original OVS.