Subtracks & Tasks
Centralized Job Scheduling
Implement Centralized Job Scheduler
A centralized scheduler is the single authority that receives all job submissions, maintains a priority queue, and dispatches work to available worker...
Implement Deadlock Prevention in Scheduling
Deadlock happens when two jobs each hold a resource the other needs, so neither can proceed. Prevention is better than detection: refuse any allocatio...
Implement Fair Job Scheduling
Pure priority scheduling causes starvation: low-priority jobs may wait forever if high-priority jobs keep arriving. Fair scheduling prevents this thro...
Implement Dependency-Aware Job Scheduling
Some jobs can only start after others finish. Dependency-aware scheduling builds an execution plan that respects these constraints while maximising pa...
Implement Resource Estimation and Provisioning
Before scheduling a job, the scheduler needs to know how many resources it requires. Good estimation averages historical data for the same job type. B...
Distributed Work Allocation
Implement Work Stealing Scheduler
A central queue becomes a bottleneck when hundreds of workers hammer it simultaneously. Work stealing eliminates it: each worker has its own local deq...
Implement MapReduce-Style Work Partitioning
Large datasets are split into partitions so multiple workers can process them simultaneously. The partitioning strategy controls how evenly work is di...
Implement Fault-Tolerant Scheduler
A scheduler that crashes loses all in-flight job assignments. A fault-tolerant scheduler writes every decision to a WAL before acting, so it can repla...
Implement Distributed Job Queue
A single-broker job queue is both a bottleneck and a single point of failure. A distributed queue partitions jobs across multiple brokers and replicat...
Implement Dynamic Scheduling with Locality Awareness
Moving a job to where its data lives is cheaper than shipping large data over the network. Locality-aware scheduling scores workers based on data prox...
Interview Prep
Common interview questions for Platform / Infrastructure Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.
Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.
Common Mistakes
The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.
Comparison Mode
Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.
Concepts Covered
Prerequisites
It is recommended to complete the previous tracks before starting this one. Concepts build progressively throughout the curriculum.
Rabbit Holes
For when you want to go deeper. Curated papers, posts, and talks beyond what this track covers.
Large-scale cluster management at Google with Borg
Google's Borg paper — the cluster manager that Kubernetes was modeled after. Describes how Google schedules millions of tasks across tens of thousands of machines, including priority tiers, admission control, and the Borglet agent design.
Kubernetes Scheduler Architecture
Kubernetes' scheduler documentation covers filtering (which nodes are eligible) and scoring (which eligible node is best), the plugin framework, and scheduling policies. A practical guide to a production scheduler used at massive scale.
Apache Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
The Mesos paper introducing the two-level scheduling model where Mesos offers resources to frameworks (Spark, Hadoop, Marathon) which then make their own scheduling decisions — the dominant alternative to centralized schedulers.