We are delighted to announce the highlight papers that will be part of the technical program of SYSTOR 2022.
Unikraft: Fast, Specialized Unikernels the Easy Way (EuroSys 2021)
Simon Kuenzer (NEC Laboratories Europe GmbH), Vlad-Andrei Bădoiu (University Politehnica of Bucharest), Hugo Lefeuvre (The University of Manchester), Sharan Santhanam (NEC Laboratories Europe GmbH), Alexander Jung (Lancaster University), Gaulthier Gain (University of Liège), Cyril Soldani (University of Liège), Costin Lupu (University Politehnica of Bucharest), Stefan Teodorescu (University Politehnica of Bucharest), Costi Răducanu (University Politehnica of Bucharest), Cristian Banu (University Politehnica of Bucharest), Laurent Mathy (University of Liège), Răzvan Deaconescu (University Politehnica of Bucharest), Costin Raiciu (University Politehnica of Bucharest), Felipe Huici (NEC Laboratories Europe GmbH)
Unikernels are famous for providing excellent performance in terms of boot times, throughput and memory consumption, to name a few metrics. However, they are infamous for making it hard and extremely time consuming to extract such performance, and for needing significant engineering effort in order to port applications to them. We introduce Unikraft, a novel micro-library OS that (1) fully modularizes OS primitives so that it is easy to customize the unikernel and include only relevant components and (2) exposes a set of composable, performance-oriented APIs in order to make it easy for developers to obtain high performance.
Our evaluation using off-the-shelf applications such as nginx, SQLite, and Redis shows that running them on Unikraft results in a 1.7x-2.7x performance improvement compared to Linux guests. In addition, Unikraft images for these apps are around 1MB, require less than 10MB of RAM to run, and boot in around 1ms on top of the VMM time (total boot time 3ms-40ms). Unikraft is a Linux Foundation open source project and can be found at www.unikraft.org.
PaSh: Light-touch Data-Parallel Shell Processing (EuroSys 2021)
Nikos Vasilakis (MIT), Konstantinos Kallas (University of Pennsylvania), Konstantinos Mamouras (Rice University), Achilles Benetopoulos (UC Santa Cruz), Lazar Cvetković (ETH Zurich)
This paper presents PaSh, a system for parallelizing POSIX shell scripts. Given a script, PaSh converts it to a dataflow graph, performs a series of semantics-preserving program transformations that expose parallelism, and then converts the dataflow graph back into a script—one that adds POSIX constructs to explicitly guide parallelism coupled with PaShprovided Unix-aware runtime primitives for addressing performance- and correctness-related issues. A lightweight annotation language allows command developers to express key parallelizability properties about their commands. An accompanying parallelizability study of POSIX and GNU commands—two large and commonly used groups—guides the annotation language and optimized aggregator library that PaSh uses. PaSh’s extensive evaluation over 44 unmodified Unix scripts shows significant speedups (0.89–61.1×, avg: 6.7×) stemming from the combination of its program transformations and runtime primitives.
Updates since publication: PaSh has joined the Linux Foundation, has its core model and transformations proved correct, has incorporated POSIX-compliant parsing, and has transitioned to a just-in-time compilation architecture that offers additional speedups across a variety of scripts. Our presentation will touch upon many of these new features!
FlexDriver: A Network Driver for Your Accelerator (ASPLOS 2022)
Haggai Eran (NVIDIA & Technion), Maxim Fudim (NVIDIA), Gabi Malka (Technion), Gal Shalom (NVIDIA & Technion), Noam Cohen (NVIDIA), Amit Hermony (NVIDIA), Dotan Levi (NVIDIA), Liran Liss (NVIDIA), Mark Silberstein (Technion)
We propose a new system design for connecting hardware and FPGA accelerators to the network, allowing the accelerator to directly control commodity Network Interface Cards (NICs) without using the CPU. This enables us to solve the key challenge of leveraging existing NIC hardware offloads such as virtualization, tunneling, and RDMA for accelerator networking. Our approach supports a diverse set of use cases, from direct network access for disaggregated accelerators to inline-acceleration of the network stack, all without the complex networking logic in the accelerator.
To demonstrate the feasibility of this approach, we build FlexDriver (FLD), an on-accelerator hardware module that implements a NIC data-plane driver. Our main technical contribution is a mechanism that compresses the NIC control structures by two orders of magnitude, allowing FLD to achieve high networking scalability with low die area cost and no bandwidth interference with the accelerator logic.
The prototype for NVIDIA Innova-2 FPGA SmartNICs showcases our design’s utility for three different accelerators: a disaggregated LTE cipher, an IP-defragmentation inline accelerator, and an IoT cryptographic-token authentication offload. These accelerators reach 25 Gbps line rate and leverage the NIC for RDMA processing, VXLAN tunneling, and traffic shaping without CPU involvement.
FragPicker: A New Defragmentation Tool for Modern Storage Devices (SOSP 2021)
Jonggyu Park (Sungkyunkwan University), Young Ik Eom (Dept. of Electrical and Computer Engineering / College of Computing and Informatics, Sungkyunkwan University)
File fragmentation has been widely studied for several decades because it negatively influences various I/O activities. To eliminate fragmentation, most defragmentation tools migrate the entire content of files into a new area. Unfortunately, such methods inevitably generate a large amount of I/Os in the process of data migration. For this reason, the conventional tools (i) cause defragmentation to be time-consuming, (ii) significantly degrade the performance of co-running applications, and (iii) even curtail the lifetime of modern storage devices. Consequently, the current usage of defragmentation is very limited although it is necessary.
Our extensive experiments discover that, unlike HDDs, the performance degradation of modern storage devices incurred by fragmentation mainly stems from request splitting, where a single I/O request is split into multiple ones. With this insight, we propose a new defragmentation tool, FragPicker, to minimize the amount of I/Os induced by defragmentation, while significantly improving I/O performance. FragPicker analyzes the I/O activities of applications and migrates only those pieces of data that are crucial to the I/O performance, in order to mitigate the aforementioned problems of existing tools. Experimental results demonstrate that FragPicker efficiently reduces the amount of I/Os for defragmentation while achieving a similar level of performance improvement to the conventional defragmentation schemes.
PACEMAKER: Avoiding HeART attacks in storage clusters with disk-adaptive redundancy (OSDI 2020)
Saurabh Kadekodi, Francisco Maturana, Suhas Jayaram Subramanya, Juncheng Yang, K. V. Rashmi, and Gregory R. Ganger. (CMU)
Data redundancy provides resilience in large-scale storage clusters, but imposes significant cost overhead. Substantial space-savings can be realized by tuning redundancy schemes to observed disk failure rates. However, prior design proposals for such tuning are unusable in real-world clusters, because the IO load of transitions between schemes overwhelms the storage infrastructure (termed transition overload).
This paper analyzes traces for millions of disks from production systems at Google, NetApp, and Backblaze to expose and understand transition overload as a roadblock to disk-adaptive redundancy: transition IO under existing approaches can consume 100% cluster IO continuously for several weeks. Building on the insights drawn, we present PACEMAKER, a low-overhead disk-adaptive redundancy orchestrator. PACEMAKER mitigates transition overload by (1) proactively organizing data layouts to make future transitions efficient, and (2) initiating transitions proactively in a manner that avoids urgency while not compromising on space-savings. Evaluation of PACEMAKER with traces from four large (110K–450K disks) production clusters show that the transition IO requirement decreases to never needing more than 5% cluster IO bandwidth (0.2–0.4% on average). PACEMAKER achieves this while providing overall space-savings of 14–20% and never leaving data under-protected. We also describe and experiment with an integration of PACEMAKER into HDFS.
The what, The from, and The to: The Migration Games in Deduplicated Systems (FAST 2022)
Roei Kisous (Technion – Israel Institute of Technology), Ariel Kolikant (Technion – Israel Institute of Technology), Abhinav Duggal (DELL EMC), Sarai Sheinvald (ORT Braude College of Engineering), Gala Yadgar (Technion – Israel Institute of Technology)
Deduplication reduces the size of the data stored in largescale storage systems by replacing duplicate data blocks with references to their unique copies. This creates dependencies between files that contain similar content, and complicates the management of data in the system. In this paper, we address the problem of data migration, where files are remapped between different volumes as a result of system expansion or maintenance. The challenge of determining which files and blocks to migrate has been studied extensively for systems without deduplication. In the context of deduplicated storage, however, only simplified migration scenarios were considered.
In this paper, we formulate the general migration problem for deduplicated systems as an optimization problem whose objective is to minimize the system’s size while ensuring that the storage load is evenly distributed between the system’s volumes, and that the network traffic required for the migration does not exceed its allocation.
We then present three algorithms for generating effective migration plans, each based on a different approach and representing a different tradeoff between computation time and migration efficiency. Our greedy algorithm provides modest space savings, but is appealing thanks to its exceptionally short runtime. Its results can be improved by using larger system representations. Our theoretically optimal algorithm formulates the migration problem as an ILP (integer linear programming) instance. Its migration plans consistently result in smaller and more balanced systems than those of the greedy approach, although its runtime is long and, as a result, the theoretical optimum is not always found. Our clustering algorithm enjoys the best of both worlds: its migration plans are comparable to those generated by the ILP-based algorithm, but its runtime is shorter, sometimes by an order of magnitude. It can be further accelerated at a modest cost in the quality of its results.