Thomas Kim Design principles for replicated storage systems built on emerging storage technologies Degree Type: Ph.D. in Computer Science Advisor(s): David G. Andersen Graduated: May 2023 Abstract: With the slowing down of Moore's law, persistent storage hardware has continued to scale at the cost of exposing hardware-level write idiosyncrasies to the software. Thus, a key challenge for systems developers is to reason about and design around these idiosyncrasies to create replicated storage systems that can effectively leverage these new technologies. Two examples of such new and emerging persistent storage technologies are Intel Optane non-volatile main memory and Zoned Namespace (ZNS) solid-state drives. Intel Optane provides persistent byte-addressable storage with throughput and latency rivaling DRAM, but per-DIMM write throughput is significantly lower than read throughput–this imbalance presents challenges in providing high availability in replicated storage systems, due to the severely limited ability to bulk-ingest data. ZNS is a new interface for NVMe-based SSDs that eliminates the flash translation layer, thus preventing garbage collection-related performance degradation and reducing the need for overprovisioned flash hardware. A consequence of these benefits is the loss of overwrite semantics for blocks in a ZNS device, thus necessitating flash-based replicated storage systems to be redesigned for ZNS compatibility. Based on our experiences and setbacks when designing, implementing, and evalu- ating systems based on Optane and ZNS, we propose three guidelines to assist devlopers in designing storage systems on new and emerging persistent storage technologies: (1) systems, even those expected to serve read-heavy workloads, should prioritize optimizing write performance, (2) set and fulfill performance, durability, and fault tolerance guarantees, but do not exceed them as that may result in excessive write overheads, and (3) systems can overcome limitations of write-constrained persistent hardware by optimizing data placement and internal data flows based on assumptions about temporal and spatial locality of the expected client workload. The first system we present is CANDStore, a highly-available, cost-effective, replicated key-value store that uses Intel Optane for primary storage, and solves the challenge of bottlenecked data ingestion during primary failure recovery through a novel online workload-guided recovery protocol. The second system we present is RAIZN, which is a system that provides RAID-like striping and redundancy for arrays of ZNS SSDs, and solves the various challenges that arise as a result of the lack of overwrite semantics in ZNS. We describe how the above guidelines arose from the setbacks and successes during the development of the above two systems, then apply these guidelines to extend the functionality of RAIZN to create RAIZN+. The final part of this thesis details exactly how we applied these guidelines to achieve near-zero write amplification when serving RocksDB workloads in RAIZN+. Thesis Committee: David G. Andersen (Chair) Michael Kiminsky Gregory R. Ganger Matias Bjørling (Western Digital) Srinivasan Seshan, Head, Computer Science Department Martial Hebert, Dean, School of Computer Science Keywords: Storage systems, replicated storage, distributed storage, persistent memory, non volatile main memory, zoned namespaces, ZNS, SSD, flash memory CMU-CS-23-109.pdf (47.8 MB) ( 117 pages) Copyright Notice