We are delighted to announce the papers that have been accepted to ACM SYSTOR’22.
Overflowing Emerging Neural Network Inference Tasks from the GPU to the CPU on Heterogeneous Servers
Adithya Kumar, Anand Sivasubramaniam, Timothy Zhu (The Pennsylvania State University)
While current deep learning (DL) inference runtime systems sequentially offload the model’s tasks on to an available GPU/accelerator based on its capability, we make a case for selectively redirecting some of these tasks to the CPU and running them concurrently with the GPU doing other work. This new opportunity specifically arises for emerging DL models whose data flow graphs (DFGs) have much wider fan-outs compared to traditional ones which are invariably linear chains of tasks. By opportunistically moving some of these tasks to the CPU, we can (i) shave off service times from the critical path of the DFG, (ii) devote the GPU for more deserving tasks, and (iii) improve overall utilization of the provisioned hardware in the server. However, several factors such as its criticality in the DFG, slowdown when moved to a different hardware engine, and overheads in transferring input/output data across these engines, determine the what/when/how of tasks to be directed. While this is computationally demanding and slow to be solved optimally, through a series of rationales we derive a fast technique for task overflow from GPU to CPU. We implement this technique on a heterogeneous concurrent runtime engine built on top of the state-of-the-art ONNXRuntime engine and demonstrate > 10\% reduction in latency, > 19\% gain in throughput, and > 9.8\% savings in GPU memory usage for emerging neural network models.
I/O Interface Independence with xNVMe
Simon A. F. Lund (Samsung), Philippe Bonnet (IT University of Copenhagen), Klaus Jensen, Javier Gonzalez (Samsung)
The tight coupling of data-intensive systems and I/O interface has been a problem for years. A database system, relying on an specific I/O backend for direct asynchronous I/Os, such as libaio, inherits its limitations in terms of portability, expressiveness and performance. The emergence of high-performance NVMe Solid-State Drives (SSDs), enabling new command sets, compounds this problem. Indeed, efforts to streamline the I/O stack have led to the introduction of new, complex and idiosyncratic I/O interfaces such as SPDK, io_uring or asynchronous ioctls. What is the appropriate I/O interface for a given system? How can applications effectively leverage SSD and end-to-end I/O interface innovations? Is I/O interface lock-in a necessary evil for data- intensive systems and storage services? Our answer to the latter question is no. Our answer to the former questions is nwio, a cross-platform user-space library that provides I/O-interface independence to user-space software. In this paper, we present the nwio API, we detail its design and we show that nwio has a negligible cost atop the most efficient I/O interfaces on Linux, FreeBSD and Windows.
Dedup-for-Speed: Storing Duplications in Fast Flash Mode for Enhanced Read Performance
Jaeyong Bae, Jaehyung Park, Yuhun Jun, Euiseong Seo (Sungkyunkwan University)
Storage deduplication improves write latency, earns additional space, and reduces the wearing of storage media by eliminating redundant writes. The flash translation layer (FTL) of a flash solid state disk (SSD) easily enables deduplication in an SSD by simply mapping duplicated logical pages to the same physical page. Therefore, a few deduplicating FTLs have been proposed thus far. However, deduplication of partially duplicated files breaks the sequentiality of data storage at the flash page level and results in significant degradation of the read performance. Although storage space saving, reduced flash writes and the extended lifespan are barely perceptible to users, the extended read latency is critical to user-perceived performance. In this paper, we propose a novel deduplication FTL, Dedup-for-Speed (DFS). The DFS FTL trades surplus capacity gained through inline deduplication for improved read performance by storing duplicated pages in fast flash modes, such as pseudo-SLC (single-level-cell). The flash mode of a page is determined by its degree of deduplication. Migrating duplicate pages to fast flash blocks is performed during idle intervals to minimize its interference with host-issued operations. Contrary to conventional deduplication schemes, DFS improves read performance while maintaining the aforementioned benefits of deduplication. Our evaluation of six real-world traces showed that DFS improved read latency by 16% on average and by up to 34%. It also enhanced write latency by 64% on average and by up to 82%.
FaaS in the Age of (sub-)μs I/O: A Performance Analysis of Snapshotting
Christos Katsakioris, Chloe Alverti (National Technical University of Athens), Vasileios Karakostas (University of Athens), Konstantinos Nikas, Georgios Goumas, Nectarios Koziris (National Technical University of Athens)
Although serverless computing brings major benefits to developers, the widespread adoption of Function-as-a-Service (FaaS) creates severe challenges for the cloud providers. Irregularity in function invocation patterns and the high cost of cold starts has led them to allocate precious DRAM resources to keep function instances always warm, a clearly sub-optimal and inflexible approach. To cope with this issue, both state-of-the-art and state-of-practice approaches consider snapshotting as a viable mitigation, thus directly associating cold start latency with storage performance.
Prior studies consider storage to be inert, rather than the evolving hierarchy that it truly is. In this work, we evaluate cold start and warm function invocations on instances restored from snapshots residing on devices across different layers of the modern storage hierarchy. We thoroughly analyze and characterize the observed behavior of multiple workloads and identify fundamental trade-offs among the devices. We conclude by motivating and providing suggestions for the inclusion of the modern storage hierarchy as a decisive factor in serverless resource provisioning.
Eliminate the Overhead of Interrupt Checking in Full-System Dynamic Binary Translator
Gen Niu, Fuxin Zhang, Xinyu Li (Institute of Computing Technology, Chinese Academy Of Sciences)
Dynamic binary translation is a common technology in program emulation, instrumentation and debugging. A full-system dynamic binary translator usually contains the software implementation of hardware devices, and it is able to emulate a complete operating system. To support that, handling interrupt is an important thing. Many full-system dynamic binary translators currently use a simple but inefficient scheme to check pending interrupts. A piece of host binary codes are inserted into each translated block and they will be executed continually to check pending interrupts. But most interrupt checking is unnecessary since the interrupt is a rare event compared to the execution of translated blocks.
In this paper, we propose a novel and efficient interrupt checking scheme. The key idea is to send the interrupt to the emulated CPU instead of letting the pending interrupts wait to be checked. A detailed evaluation is performed with SPEC CPU2000. And we use another two small benchmarks for fast evaluation. The experimental result shows about 30% performance improvement when the block size is limited to 1, which is usually used for profiling and debugging. The overall performance improvement is about 2%~3% with normal block size. The interrupt latency increases a little due to the communication, but the overall performance is not affected. Finally, through additional experiments, we find that the key to improving performance is removing the branch instruction from the interrupt checking codes.
Fantastic SSD Internals and How to Learn and Use Them
Nanqinqin Li (University of Chicago and Princeton University); Mingzhe Hao (University of Chicago); Huaicheng Li (University of Chicago and Carnegie Mellon University); Xing Lin, Tim Emami (NetApp); Haryadi S. Gunawi (University of Chicago)
This work presents (a) Queenie, an application-level tool that can automatically learn 10 internal properties of block-level SSDs, (b) Kelpie, the learning and analysis result of running Queenie on 21 different SSD models from 7 major SSD vendors, and (c) Newt, a set of storage performance optimization examples that use the learned properties.
Instant Data Sanitization on Multi-Level-Cell NAND Flash Memory
Md Raquibuzzaman, Matchima Buddhanoy, Aleksandar Milenkovic, Biswajit Ray (The University of Alabama in Huntsville)
Deleting data instantly from NAND flash memories incurs hefty overheads, and increases wear level. Existing solutions involve unlinking the physical page addresses making data inaccessible through standard interfaces, but they carry the risk of data leakage. An all-zero-in-place data overwrite has been proposed as a countermeasure, but it applies only to SLC flash memories. This paper introduces an instant page data sanitization method for MLC flash memories that prevents leakage of deleted information without any negative effects on valid data in shared pages. We implement and evaluate the proposed method on commercial 2D and 3D NAND flash memory chips.
O-AFA : Order Preserving All Flash Array
Seung Won Yoo;KAIST (Korea Advanced Institute of Science and Technology) (primary) Joontaek Oh;KAIST (Korea Advanced Institute of Science and Technology) Youjip Won;KAIST (Korea Advanced Institute of Science and Technology)
The Linux I/O stack interleaves transfer and flush between I/O to preserve order. However, it results in poor I/O performance. Some recent research has suggested a novel order-preserving mechanism for a single storage or multiple devices in a constrained environment. However, these works are dedicated to the single storage or multiple devices in a constrained environment. There is no research suggesting order preserving mechanism which works on any multiple devices. In this work, we present a new order preserving mechanism which works on any barrier-compliant multiple devices. Three new designs are proposed. First, cache barrier stripe is employed to preserve order between epochs. \texttt{WRITE BARRIER} commands are dispatched to all disks in flash array. The basic unit of ordered I/O is the epoch. The order between epoch must be preserved while the order within epoch does not. Second, Epoch in Flash Array is employed to follow the ordering constraint imposed by filesystem. It instruct the Software RAID thread to dispatch the last stripe of epoch as cache barrier stripe. Third, shadow-page aware dispatch is employed. It brings 19% of performance gain compared to transfer overhead. Combination of these ideas preserve orders well. The leverage of new order preserving mechanism brings 75% performance benefit for varmail and 77% for MySQL.
TACC: A Secure Accelerator Enclave for AI Workloads
Jianping Zhu, Rui Hou, Dan Meng (Institute of Information Engineering, Chinese Academy of Sciences)
We present a Secure Accelerator Enclave design, which includes heterogeneous accelerator running AI workloads into the protection scope of Trusted Execution Environment, called TACC (Trusted Accelerator). TACC supports dynamic user switching and context clearing of accelerator enclave from the microarchitecture level; The physical isolation of in-package memory (3D chip package) and off-package memory is used to realize the full stack (from hardware to software) isolation of enclave internal running memory and external ciphertext memory; It is also equipped with independent hardware AES-GCM module (including DMA engine) to be responsible for the interaction between internal and external memory. On a FPGA development board containing Xilinx xc7z100-ffg900-2 chip, we implemented two versions of TACC prototypes: FAT (144 multipliers and 48 blockRAMs) and SLIM (36 multipliers and 12 blockRAMs). We deployed and ran the RepVGG inference neural networks on them respectively under different batch sizes. The average overhead of our security mechanism is no more than 1.76%.
Efficient Sharing of Linked DMA Channels on Multi-Sensor Devices by LDMA Task Scheduler
You Ren Shen, BO YAN HUANG, Chang Lin Shih, Pai H. Chou (National Tsing Hua University)
Modern microcontroller units (MCUs) support enhanced direct memory access (DMA) such as Linked DMA (LDMA) mechanisms that not only offload bulk I/O from the processor core but also support simple commands to minimize processor intervention between bulk transfers. However, straightforward offloading to dedicated channels results in full occupancy of the channel resources even if the actual I/O load is low. To address this problem, we propose an LDMA task scheduler that schedules groups of tasks to enable their shared access to the same set of channels. When applied to a real-life multi-sensor device, the proposed scheme reduces total channel occupancy from 100% down to 42.8% while incurring minimal processor overhead of 0.1%, thereby enabling offloading of twice as many I/O tasks as the fixed channel allocation scheme.
Understanding Modern Storage APIs: A systematic study of libaio, SPDK, and io_uring
Diego Didona, Jonas Pfefferle, Nikolas Ioannou, Bernard Metzler (IBM Research Zurich); Animesh Trivedi (VU Amsterdam)
Recent high-performance storage devices have exposed soft- ware inefficiencies of existing storage stacks, leading to a new breed of I/O stacks. The newest storage API of the Linux kernel is io_uring. We perform one of the first, in-depth studies of io_uring, and compare its performance and dis- /advantages with the established libaio and SPDK APIs. Our key findings reveal that (i) polling design significantly impacts performance; (ii) with enough CPU cores io_uring can deliver performance close to SPDK; (iii) performance scalability over multiple CPU cores and devices requires careful considerations and necessitates a hybrid approach. We also provide design guidelines for storage application developers.
Bulk JPEG Decoding on In-Memory Processors
Joel Nider, Jackson Dagger, Niloo Gharavi, Daniel Ng, Alexandra (Sasha) Fedorova (University of British Columbia)
JPEG is a common encoding format for digital images. Applications that process large numbers of images can be accelerated by decoding multiple images concurrently. We examine the suitability of using a large array of in-memory processors (PIM) to obtain a high throughput of decoding. The main drawback of PIM processors is that they do not have the same architectural features that are commonly found on CPUs such as floating point, vector units and hardware-managed caches. Despite the lack of features, we demonstrate that it is feasible to build a JPEG decoder for PIM, and evaluate its quality and potential speedup. We show that the quality of decoded images is sufficient for real applications, and there is a significant potential for accelerating image decoding for those applications. We share our experiences in building such a decoder, and the challenges we faced while doing so.