Post-Mortems

Real production incidents, analysed with actual root causes, failure diagrams, and the specific lessons that apply when you are building distributed systems yourself.

No hypotheticals. No vendor spin. Just what broke and why.

SEV1GitHub Packages / Container Registry2h 08m9 min read

The CDN Config Push That Made Docker Pulls Fail for Two Hours

A configuration push to GitHub's CDN layer introduced an incorrect cache-control directive that caused container image layer requests to return stale 404 responses, breaking docker pull for any image modified in the preceding 72 hours.

April 28, 2026

github-packagescdndocker

SEV1GitHub Actions1h 32m8 min read

How a Runner Pool Miscalculation Stalled GitHub Actions for 90 Minutes

A capacity planning miscalculation in GitHub's hosted runner fleet caused 78% of queued workflow jobs to wait indefinitely during peak hours, with the queue growing faster than the autoscaling system could provision new runners.

April 22, 2026

github-actionsautoscalingcapacity-planning

SEV1GitHub Actions2h 55m7 min read

The Redis Load Balancer That Silently Broke CI for Three Hours

A misconfigured load balancer in GitHub's Redis infrastructure caused 95% of Actions workflow runs to silently fail at the dispatch layer, with no error reaching the engineer who triggered them.

March 5, 2026

redisload-balancergithub-actions

SEV1GitHub.com / GitHub API1h 10m6 min read

Cache Expiration, Replica Lag, and How 40% of GitHub Went Dark

A caching bug introduced in a routine deployment caused cache recalculation to read from a lagging read replica. The result was a cascade of stale cached data serving incorrect responses to 40% of github.com requests and 43% of API calls.

March 3, 2026

cachingreplication-lagread-replica