The Redis Load Balancer That Silently Broke CI for Three Hours
A misconfigured load balancer in GitHub's Redis infrastructure caused 95% of Actions workflow runs to silently fail at the dispatch layer, with no error reaching the engineer who triggered them.
On March 5, 2026, GitHub's engineering team made a routine change to their Redis infrastructure. Routine is the word you use when you expect nothing to go wrong. Two hours and fifty-five minutes later, 95% of GitHub Actions workflow runs had silently failed to start.
No error reached the engineer who triggered them. That is the thing worth sitting with. You push a commit, the Actions tab shows the workflow queued, and then nothing runs. The trigger API returned success. The job was enqueued. Everything looked fine at the surface. Inside, the dispatch machinery had quietly stopped working.
What GitHub Actions Is Actually Doing When You Push
GitHub Actions uses a distributed job queue backed by Redis. When a workflow triggers, the event flows through a dispatch service that serializes the job definition and pushes it onto a Redis-backed queue. A pool of worker processes runs continuously, polling that queue for available work. When a worker claims a job, it routes the job to the appropriate runner infrastructure for execution.
A load balancer sits in front of the Redis cluster. This is normal at GitHub's scale: you need health checking, connection pooling, and routing logic across multiple Redis primaries. The LB maps incoming worker connections to available Redis backends.
Where It Broke
During the March 5 infrastructure update, load balancer settings were misconfigured. Workers connected to the LB successfully (the TCP handshake completed), but were routed to the wrong Redis backend. The queue accepted new jobs on the enqueue side. On the dequeue side, workers could not reach the correct nodes to claim work.
The result was the queue equivalent of a locked door with no handle on either side. Jobs piled up on the enqueue side with no consumer able to claim them.
Why 95% and Not 100%?
The 5% that continued working is telling. A partial misconfiguration suggests the routing error affected a subset of the cluster, likely specific nodes or key ranges. Workers that happened to establish connections through the unaffected routing path continued functioning normally. The rest stalled.
The Detection Problem
The failure was invisible at the trigger boundary. Standard API error rate monitoring wouldn't catch it because the API was returning 200. What eventually surfaced the incident was queue depth growing without corresponding job completions, a signal that requires end-to-end observability rather than just endpoint health checks.
Werner Vogels wrote years ago that "everything fails, all the time." The corollary is that distributed systems fail in ways that respect partial topologies. Not everything fails. Just enough fails to break the system without making it obvious where.
What It Cost
The incident metrics show 2h 55m of 95% CI failure. What they don't show: engineers who assumed their pipeline was broken and cancelled runs, release pipelines that stalled waiting for green builds, time spent debugging locally before anyone checked GitHub's status page. These costs are real and invisible.
What Changed Afterward
GitHub rolled back the configuration and implemented improved automation for future Redis infrastructure changes. The automation point matters more than the rollback. Configuration changes to distributed infrastructure should be staged, validated end-to-end against real queue behavior, and automatically reverted if key metrics deviate within the first few minutes.
The post-incident note flagged that the staging environment lacked sufficient mock data to surface the failure. A staging environment that doesn't replicate the topology of production is a confidence-generating illusion. It makes you feel safe without making you safe.
Lesson from the architecture: queue-based systems have a specific failure mode where the enqueue and dequeue paths are fully decoupled. A healthy enqueue does not imply a healthy dequeue. Your monitoring needs to observe both sides, not just the one your users interact with directly.
Build the system yourself
Reading about failures is useful. Understanding why they happen means building these systems yourself and experiencing the failure modes directly.
Browse all tracks