Tracks/The Filesystem

The Filesystem

Advanced

Storage|10 tasks

GFS and HDFS showed the world how to store petabytes across thousands of cheap machines. Build a tiny distributed filesystem with chunk servers, replication, and master failover.

Subtracks & Tasks

Distributed File Storage

0/5

DI-1

advanced

Design a GFS-Style Distributed File System Architecture

The Google File System (GFS) architecture is the foundation of modern distributed storage. It separates metadata (managed by a master) from data (stor...

GFS architecturemaster nodechunk server+2 more

DI-2

advanced

Implement the Master Namespace Tree

The master's namespace is a hierarchical tree of directories and files. It maps every file to its chunks and their locations. This is stored entirely ...

namespace treedirectory hierarchychunk mapping+2 more

DI-3

advanced

Implement Chunk Creation and Allocation

When a client creates a file or appends a new chunk, the master must allocate chunk storage on appropriate chunk servers. Chunk creation flow: 1. Cli...

chunk allocationplacement policyrack awareness+1 more

DI-4

advanced

Implement Chunk Replication with Pipeline Writes

When a client writes data, the primary chunk server coordinates replication to all secondaries. GFS uses a **pipeline** design where data flows in a c...

chunk replicationpipeline writesprimary-secondary+2 more

DI-5

advanced

Implement Chunk Leases for Primary Assignment

A chunk lease grants one chunk server the exclusive right to define the mutation order for a chunk. This avoids per-operation consensus while maintain...

leaseprimary electionlease renewal+2 more

Fault Tolerance and Rebalancing

0/5

FA-1

intermediate

Implement Chunk Server Heartbeats

Chunk server heartbeats are the master's only mechanism for tracking which servers are alive and which chunks they hold. Without heartbeats, the maste...

heartbeatchunk server monitoringliveness detection+1 more

FA-2

advanced

Implement Automatic Re-Replication

When a chunk server dies, its chunks become under-replicated. The master must automatically schedule re-replication to restore the target replication ...

re-replicationunder-replicated chunksreplication factor+1 more

FA-3

advanced

Implement Chunk Server Load Balancing

Over time, chunk distribution becomes uneven: new servers start empty, old servers fill up, and some receive more writes. Load balancing moves chunks ...

load balancingchunk migrationdisk utilization+1 more

FA-4

advanced

Implement Master Failover with Shadow Master

The master is a single point of failure. A shadow master mitigates this by continuously replaying the primary's WAL, staying nearly synchronized. Fai...

master failovershadow masterWAL replay+2 more

FA-5

intermediate

Implement Chunk Checksums for Data Integrity

Disks can silently corrupt data without any error signal. Chunk checksums detect this corruption before it is returned to users. Checksum design: 1. ...

checksumdata integritycorruption detection+2 more

Interview Prep

Common interview questions for Infrastructure / Storage Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.

Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.

Common Mistakes

The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.

Comparison Mode

Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.

Concepts Covered

GFS architecturemaster nodechunk server64MB chunksreplication factornamespace treedirectory hierarchychunk mappingmetadataWAL-backedchunk allocationplacement policyrack awarenessprimary assignmentchunk replicationpipeline writesprimary-secondarywrite acknowledgementdata flowleaseprimary electionlease renewallease expiryconsistency windowheartbeatchunk server monitoringliveness detectionchunk inventoryre-replicationunder-replicated chunksfailure recoveryload balancingchunk migrationdisk utilizationrebalancing thresholdmaster failovershadow masterWAL replayhot standbyfailover timechecksumdata integritycorruption detectionper-block checksumsilent corruption

Prerequisites

It is recommended to complete the previous tracks before starting this one. Concepts build progressively throughout the curriculum.

🕳

Rabbit Holes

For when you want to go deeper. Curated papers, posts, and talks beyond what this track covers.

Paper

The Google File System

The 2003 SOSP paper describing GFS — the distributed file system that influenced HDFS, Ceph, and essentially every large-scale distributed storage system that followed. Refreshingly honest about the trade-offs made for Google's specific workload.

Paper

The Hadoop Distributed File System

HDFS was designed as an open-source implementation of GFS. This paper covers the key differences — stronger consistency model, the NameNode design — and the operational experience from running HDFS at Yahoo scale.

Paper

Ceph: A Scalable, High-Performance Distributed File System

The Ceph paper from OSDI 2006 introducing CRUSH (Controlled Replication Under Scalable Hashing) for data placement. Ceph's elimination of a central metadata server using CRUSH is its key architectural innovation over GFS.