Subtracks & Tasks
Distributed Tracing
Implement Distributed Trace Context Propagation
Distributed tracing links spans from a single request as it flows through multiple services. Each service propagates the trace context (trace ID + par...
Implement Span Lifecycle Management
A span represents one operation within a trace: it has a name, start and end timestamps, status, and optionally events and links. The span kind identi...
Implement Distributed Trace Collector
A trace collector receives spans from many services, groups them by trace ID, and stores the assembled traces. It also applies sampling to reduce stor...
Implement Trace Analysis and Insights
Raw traces tell you what happened. Trace analysis tells you why it was slow and where errors are concentrated. By aggregating many traces, you surface...
Implement End-to-End Distributed Tracing System
Individual span operations give you the building blocks. End-to-end tracing connects them: every service creates and propagates spans, logs are correl...
Metrics and Alerting
Implement Metrics Collection
Metrics quantify system behavior: how many requests, how fast, how much memory. Three types cover nearly everything: counters (monotonically increasin...
Implement Alerting Rules Engine
An alerting rules engine evaluates metric conditions and fires notifications when thresholds are breached. It routes alerts to the right channel based...
Implement Metrics Aggregation and Rollups
Individual data points are too granular for dashboards and alerting. Aggregation reduces them to meaningful summaries: totals across services, average...
Implement Monitoring Dashboards and Visualization
Dashboards aggregate multiple metric queries into a single health view. Each panel runs a different query (request rate, error rate, latency) and visu...
Implement Alert Integrations and On-Call Management
Alert integrations route notifications to the right people and tools. Critical incidents trigger PagerDuty to page the on-call engineer. Non-critical ...
Interview Prep
Common interview questions for Platform / Observability Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.
Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.
Common Mistakes
The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.
Comparison Mode
Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.
Concepts Covered
Prerequisites
It is recommended to complete the previous tracks before starting this one. Concepts build progressively throughout the curriculum.
Rabbit Holes
For when you want to go deeper. Curated papers, posts, and talks beyond what this track covers.
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
Google's 2010 paper introducing Dapper — the internal distributed tracing system that inspired Zipkin, Jaeger, and OpenTelemetry's data model. Covers trace context propagation, sampling strategies, and overhead management techniques.
OpenTelemetry Concepts: Traces
The canonical reference for the OpenTelemetry trace data model — spans, trace context, attributes, events, and links. OpenTelemetry is now the vendor-neutral standard for distributed tracing instrumentation.
Tail-Based Sampling in Distributed Tracing
OpenTelemetry's documentation on tail-based sampling — the technique for keeping 100% of traces from slow or failing requests while sampling out fast, successful ones. Understanding this is essential for operating a production tracing system cost-effectively.