본문 바로가기
논문 스터디

[SC'21] DeltaFS: A Scalable No-Ground-Truth Filesystem For Massively-Parallel Computing

by shiny_sneakers 2022. 9. 6.

[SC'21] DeltaFS: A Scalable No-Ground-Truth Filesystem For Massively-Parallel Computing

1. Motivations

  • Global synchronization
    • Today's filesystem clients tend to synchronize too frequently with their servers.
  • The inadequacy of the current state-of-art
    • Today's filesystems map all application jobs to a single filesystem namespace.
    • Filesystem metadata performance is limited by the amount of dedicated MDS resources.
  • From one-size-fits-all to no ground truth
    • Today's parallel applications are for the most part non-interactive batch jobs.
      • that do not neccesarily benefit from many of the early network filesystems' semantic obligations.

2. DeltaFs Overview

  • Main components

    • User jobs
      • Parallel programs or scripts that are submitted to run on compute nodes.
      1. Starts by self-defining its filesystem namespace.
      2. Instantiates DeltaFS client and server instances to serve the namespace.
      3. At the end, it may release its namespace as a public snapshot searchable and mergeable by other jobs.
    • Namespace Registery
      • Keepers of all published DeltaFS namespace snapshots.
      • Registry demons run on dedicated server nodes in a computing cluster.
      • Each registry is a simple Key(snapshot name)-Value(pointers to the snapshots’ manifest objects) table mapping.
      • To enable queries beyond simple snapshot listing, it can be paired with a secondary indexing tier where snapshots are indexed by attributes (ID, filename, create time…).
    • Compaction Runners
      • Parallel compaction jobs dynamically launched by users or compute nodes.
      • Compaction is scheduled over a large number of client compute cores on an as-needed basis.

3. No Ground Truth

  • DeltaFS does not provied a global filesystem namespace.
    • Instead, it records the metadata mutations each job generates as immutable logs.
    • Enabling jobs to self-define their namespace consistency avoids unnecessary synchronization in a large computing cluster.
  • A log-structured filesystem
    • Logs written by one job can be understood by all jobs, making cross-job communication possible.
    • Published log (change set) entries can later be used by subsequent jobs for their namespace instantiation, allowing for efficient inter-job data propagation
  • Multi-inheritance & Name resolution
    • DeltaFS allows complex client namespace views to be efficiently materialized for fast reads through a parallel compaction mechanism.
  • Complexity
    • Indirect dependencies are automatically resolved and ordered.

4. Per-Job Log Management

  • DeltaFS uses a log format in which each filesystem metadata mutation is recoreded as a KV pair.
  • Log management schemes are similar with LSM based KV stores.

5. Dynamic Service Instantiation

  • Per-job metadata processing
    • metadata operations are performed by clients sending RPCs to servers.
  • Client logging
    • Synchronization between DeltaFS clients and servers within a parallel job ensures that files created by one process are immediately visible to all processes in that job.
  • Namespace curation
  • Fault tolerence, Aging, and Sequential data sharing

5. Cross-Job Parallel Log Compaction

  • Explicitely scheduled by users.
  • User launches cross-job compaction either when a complex job change set heriarchy needs to be flattened for efficient queries.

6. Why previous non-dedicated server architecture failed?

  • Each filesystem client runs an embedded metadata manager.

    • This manager serves both the client and other clients sharing the same filesystem in a LAN.
    • To achieve synchronization, distributed locking is used to control access to the shared filesystem and to client data and metadata caches.
  • Scalability is often an issue due to the large amount of synchronization needed to access the filesystem.

    • To mitigate this problem, real world deployments typically dedicate a small set of nodes to run filesystem clients with embedded metadata managers.

      → Defeat the goal of having no dedicated managers and fail to utilize compute node resources

댓글