Tracks/The Scheduler

The Scheduler

Advanced

Operations|10 tasks

How does your system know which node should run which job - and what happens when that node dies? Build job queues, priority scheduling, cron systems, and distributed work allocation from scratch.

Subtracks & Tasks

Centralized Job Scheduling

0/5

CE-1

intermediate

Implement Centralized Job Scheduler

A centralized scheduler is the single authority that receives all job submissions, maintains a priority queue, and dispatches work to available worker...

priority queueworker assignmentjob dispatch+2 more

CE-2

advanced

Implement Deadlock Prevention in Scheduling

Deadlock happens when two jobs each hold a resource the other needs, so neither can proceed. Prevention is better than detection: refuse any allocatio...

Banker's algorithmsafe statewait-for graph+2 more

CE-3

advanced

Implement Fair Job Scheduling

Pure priority scheduling causes starvation: low-priority jobs may wait forever if high-priority jobs keep arriving. Fair scheduling prevents this thro...

starvation preventionagingMLFQ+3 more

CE-4

advanced

Implement Dependency-Aware Job Scheduling

Some jobs can only start after others finish. Dependency-aware scheduling builds an execution plan that respects these constraints while maximising pa...

topological sortcritical pathcircular dependency+2 more

CE-5

advanced

Implement Resource Estimation and Provisioning

Before scheduling a job, the scheduler needs to know how many resources it requires. Good estimation averages historical data for the same job type. B...

resource estimationbin packingauto-scaling+2 more

Distributed Work Allocation

0/5

DI-1

advanced

Implement Work Stealing Scheduler

A central queue becomes a bottleneck when hundreds of workers hammer it simultaneously. Work stealing eliminates it: each worker has its own local deq...

work stealingdequeLIFO stealing+2 more

DI-2

advanced

Implement MapReduce-Style Work Partitioning

Large datasets are split into partitions so multiple workers can process them simultaneously. The partitioning strategy controls how evenly work is di...

hash partitioningrange partitioningdata skew+2 more

DI-3

advanced

Implement Fault-Tolerant Scheduler

A scheduler that crashes loses all in-flight job assignments. A fault-tolerant scheduler writes every decision to a WAL before acting, so it can repla...

WALcrash recoveryleader election+2 more

DI-4

advanced

Implement Distributed Job Queue

A single-broker job queue is both a bottleneck and a single point of failure. A distributed queue partitions jobs across multiple brokers and replicat...

partitioned queuereplicationconsumer assignment+2 more

DI-5

advanced

Implement Dynamic Scheduling with Locality Awareness

Moving a job to where its data lives is cheaper than shipping large data over the network. Locality-aware scheduling scores workers based on data prox...

data localityrack awarenessworker scoring+2 more

Interview Prep

Common interview questions for Platform / Infrastructure Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.

Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.

Common Mistakes

The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.

Comparison Mode

Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.

Concepts Covered

priority queueworker assignmentjob dispatchfailure handlingqueue statusBanker's algorithmsafe statewait-for graphpreemptioncycle detectionstarvation preventionagingMLFQtime quantumI/O-bound promotionfair sharetopological sortcritical pathcircular dependencyfailure propagationparallel roundsresource estimationbin packingauto-scalinghistorical analysispacking efficiencywork stealingdequeLIFO stealinglock-free schedulingidle detectionhash partitioningrange partitioningdata skewstraggler mitigationspeculative executionWALcrash recoveryleader electiongeneration numbersduplicate preventionpartitioned queuereplicationconsumer assignmentpartition rebalancingbroker failoverdata localityrack awarenessworker scoringdynamic data placementload vs locality tradeoff

Prerequisites

It is recommended to complete the previous tracks before starting this one. Concepts build progressively throughout the curriculum.

🕳

Rabbit Holes

For when you want to go deeper. Curated papers, posts, and talks beyond what this track covers.

Paper

Large-scale cluster management at Google with Borg

Google's Borg paper — the cluster manager that Kubernetes was modeled after. Describes how Google schedules millions of tasks across tens of thousands of machines, including priority tiers, admission control, and the Borglet agent design.

Blog

Kubernetes Scheduler Architecture

Kubernetes' scheduler documentation covers filtering (which nodes are eligible) and scoring (which eligible node is best), the plugin framework, and scheduling policies. A practical guide to a production scheduler used at massive scale.

Paper

Apache Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

The Mesos paper introducing the two-level scheduling model where Mesos offers resources to frameworks (Spark, Hadoop, Marathon) which then make their own scheduling decisions — the dominant alternative to centralized schedulers.