Post-Mortems
Real production incidents, analysed with actual root causes, failure diagrams, and the specific lessons that apply when you are building distributed systems yourself.
No hypotheticals. No vendor spin. Just what broke and why.
The CDN Config Push That Made Docker Pulls Fail for Two Hours
A configuration push to GitHub's CDN layer introduced an incorrect cache-control directive that caused container image layer requests to return stale 404 responses, breaking docker pull for any image modified in the preceding 72 hours.
How a Runner Pool Miscalculation Stalled GitHub Actions for 90 Minutes
A capacity planning miscalculation in GitHub's hosted runner fleet caused 78% of queued workflow jobs to wait indefinitely during peak hours, with the queue growing faster than the autoscaling system could provision new runners.
The Redis Load Balancer That Silently Broke CI for Three Hours
A misconfigured load balancer in GitHub's Redis infrastructure caused 95% of Actions workflow runs to silently fail at the dispatch layer, with no error reaching the engineer who triggered them.
Cache Expiration, Replica Lag, and How 40% of GitHub Went Dark
A caching bug introduced in a routine deployment caused cache recalculation to read from a lagging read replica. The result was a cascade of stale cached data serving incorrect responses to 40% of github.com requests and 43% of API calls.