[SC'21] DeltaFS: A Scalable No-Ground-Truth Filesystem For Massively-Parallel Computing

1. Motivations

Global synchronization
- Today's filesystem clients tend to synchronize too frequently with their servers.
The inadequacy of the current state-of-art
- Today's filesystems map all application jobs to a single filesystem namespace.
- Filesystem metadata performance is limited by the amount of dedicated MDS resources.
From one-size-fits-all to no ground truth
- Today's parallel applications are for the most part non-interactive batch jobs.
  - that do not neccesarily benefit from many of the early network filesystems' semantic obligations.

DeltaFS does not provied a global filesystem namespace.
- Instead, it records the metadata mutations each job generates as immutable logs.
- Enabling jobs to self-define their namespace consistency avoids unnecessary synchronization in a large computing cluster.
A log-structured filesystem
- Logs written by one job can be understood by all jobs, making cross-job communication possible.
- Published log (change set) entries can later be used by subsequent jobs for their namespace instantiation, allowing for efficient inter-job data propagation
Multi-inheritance & Name resolution
- DeltaFS allows complex client namespace views to be efficiently materialized for fast reads through a parallel compaction mechanism.
Complexity
- Indirect dependencies are automatically resolved and ordered.

DeltaFS uses a log format in which each filesystem metadata mutation is recoreded as a KV pair.
Log management schemes are similar with LSM based KV stores.

Per-job metadata processing
- metadata operations are performed by clients sending RPCs to servers.
Client logging
- Synchronization between DeltaFS clients and servers within a parallel job ensures that files created by one process are immediately visible to all processes in that job.
Namespace curation
Fault tolerence, Aging, and Sequential data sharing

Explicitely scheduled by users.
User launches cross-job compaction either when a complex job change set heriarchy needs to be flattened for efficient queries.

Each filesystem client runs an embedded metadata manager.
- This manager serves both the client and other clients sharing the same filesystem in a LAN.
- To achieve synchronization, distributed locking is used to control access to the shared filesystem and to client data and metadata caches.
Scalability is often an issue due to the large amount of synchronization needed to access the filesystem.
- To mitigate this problem, real world deployments typically dedicate a small set of nodes to run filesystem clients with embedded metadata managers.
  
  → Defeat the goal of having no dedicated managers and fail to utilize compute node resources

[HotStorage'22] Lifetime-leveling LSM-tree compaction for ZNS SSD (0)	2022.07.25
[FAST'19] SLM-DB: Single-Level Key-Value Store with Persistent Memory (0)	2022.07.18
[HotStorage'22] Compaction-Aware Zone Allocation for LSM based Key-Value Store on ZNS SSDs (0)	2022.07.18
[ATC'18] Redesigning LSMs for Nonvolatile Memory with NoveLSM (0)	2022.07.04
[ATC'20] MatrixKV: Reducing Write Stalls and Write Amplification in LSM-tree Based KV Stores with Matrix Container in NVM (0)	2022.06.30