[SC'21] DeltaFS: A Scalable No-Ground-Truth Filesystem For Massively-Parallel Computing
1. Motivations
- Global synchronization
- Today's filesystem clients tend to synchronize too frequently with their servers.
- The inadequacy of the current state-of-art
- Today's filesystems map all application jobs to a single filesystem namespace.
- Filesystem metadata performance is limited by the amount of dedicated MDS resources.
- From one-size-fits-all to no ground truth
- Today's parallel applications are for the most part non-interactive batch jobs.
- that do not neccesarily benefit from many of the early network filesystems' semantic obligations.
- Today's parallel applications are for the most part non-interactive batch jobs.
2. DeltaFs Overview
Main components
- User jobs
- Parallel programs or scripts that are submitted to run on compute nodes.
- Starts by self-defining its filesystem namespace.
- Instantiates DeltaFS client and server instances to serve the namespace.
- At the end, it may release its namespace as a public snapshot searchable and mergeable by other jobs.
- Namespace Registery
- Keepers of all published DeltaFS namespace snapshots.
- Registry demons run on dedicated server nodes in a computing cluster.
- Each registry is a simple Key(snapshot name)-Value(pointers to the snapshots’ manifest objects) table mapping.
- To enable queries beyond simple snapshot listing, it can be paired with a secondary indexing tier where snapshots are indexed by attributes (ID, filename, create time…).
- Compaction Runners
- Parallel compaction jobs dynamically launched by users or compute nodes.
- Compaction is scheduled over a large number of client compute cores on an as-needed basis.
- User jobs
3. No Ground Truth
- DeltaFS does not provied a global filesystem namespace.
- Instead, it records the metadata mutations each job generates as immutable logs.
- Enabling jobs to self-define their namespace consistency avoids unnecessary synchronization in a large computing cluster.
- A log-structured filesystem
- Logs written by one job can be understood by all jobs, making cross-job communication possible.
- Published log (change set) entries can later be used by subsequent jobs for their namespace instantiation, allowing for efficient inter-job data propagation
- Multi-inheritance & Name resolution
- DeltaFS allows complex client namespace views to be efficiently materialized for fast reads through a parallel compaction mechanism.
- Complexity
- Indirect dependencies are automatically resolved and ordered.
4. Per-Job Log Management
- DeltaFS uses a log format in which each filesystem metadata mutation is recoreded as a KV pair.
- Log management schemes are similar with LSM based KV stores.
5. Dynamic Service Instantiation
- Per-job metadata processing
- metadata operations are performed by clients sending RPCs to servers.
- Client logging
- Synchronization between DeltaFS clients and servers within a parallel job ensures that files created by one process are immediately visible to all processes in that job.
- Namespace curation
- Fault tolerence, Aging, and Sequential data sharing
5. Cross-Job Parallel Log Compaction
- Explicitely scheduled by users.
- User launches cross-job compaction either when a complex job change set heriarchy needs to be flattened for efficient queries.
6. Why previous non-dedicated server architecture failed?
Each filesystem client runs an embedded metadata manager.
- This manager serves both the client and other clients sharing the same filesystem in a LAN.
- To achieve synchronization, distributed locking is used to control access to the shared filesystem and to client data and metadata caches.
Scalability is often an issue due to the large amount of synchronization needed to access the filesystem.
To mitigate this problem, real world deployments typically dedicate a small set of nodes to run filesystem clients with embedded metadata managers.
→ Defeat the goal of having no dedicated managers and fail to utilize compute node resources
댓글