Post-MortemSEV11h 32m

How a Runner Pool Miscalculation Stalled GitHub Actions for 90 Minutes

A capacity planning miscalculation in GitHub's hosted runner fleet caused 78% of queued workflow jobs to wait indefinitely during peak hours, with the queue growing faster than the autoscaling system could provision new runners.

SystemGitHub Actions

DateApril 22, 2026

Duration1h 32m

What GitHub Actions Does With Your Job

When a workflow triggers, GitHub's scheduling system places your job in a regional work queue. An assignment service continuously matches queued jobs against available runners. A runner can be a GitHub-hosted machine (ubuntu-22.04, windows-2022, macos-14) or a self-hosted runner you operate yourself.

For hosted runners, the lifecycle is tightly orchestrated. GitHub pre-provisions a pool of warm runner instances — machines that have already booted, configured their environment, and registered with the assignment service. When your job is assigned, it lands on a warm runner and starts in seconds.

The warm pool is not infinite. GitHub targets a pool size based on a rolling forecast of demand. When actual demand exceeds the warm pool, new runners must be provisioned from cold infrastructure — booting a virtual machine, installing the Actions agent, downloading the runner image. That process takes between 2 and 4 minutes depending on region and machine size.

The warm pool is why GitHub Actions feels fast. It is also why, when the pool empties faster than cold provisioning can refill it, the system stalls.

The Forecast Was Three Weeks Old

GitHub's runner pool sizing uses a demand forecast generated weekly from the previous 30 days of job volume. The April 22 incident happened on a Tuesday afternoon — a peak hour for North American engineers — following three consecutive weeks of above-average traffic.

A new feature announcement from a large developer tools company that morning had driven an unusual spike in CI activity. Engineers were updating dependencies, testing compatibility, and pushing releases before their end-of-day. The job arrival rate at 14:07 UTC was approximately 2.3 times the projected hourly average.

The warm pool in the us-east region was sized for 1.4 times the projected peak — a standard safety margin for random variance. A 2.3 times spike exhausted the margin in under four minutes.

The autoscaling controller saw the queue depth rising at 14:09 and began issuing provisioning requests to GitHub's compute layer. Cold runner provisioning takes 2 to 4 minutes. Those runners started coming online at 14:11 but were immediately consumed by the backlog that had accumulated since 14:07. The queue grew faster than provisioning could drain it.

This is a textbook provisioning lag problem. The controller is reactive. It responds to what it observes. By the time the observation is made, acted on, and the result appears, conditions have changed. At 2.3 times the expected load, the lag was catastrophic.

The Autoscaler Was Set Too Conservatively

GitHub's autoscaling controller has two parameters relevant to this incident: the scale-up threshold (how long the queue must sit at a given depth before new runners are requested) and the maximum provisioning rate (how many new runners can be requested per minute).

The scale-up threshold was set to 45 seconds. At the observed arrival rate, that meant roughly 1,200 additional jobs entered the queue during each observation window before a scale-up decision was made. The maximum provisioning rate was set to 200 runners per minute, a conservative cap intended to avoid hammering the underlying compute APIs.

Neither parameter was wrong for normal conditions. Together, under 2.3 times normal load, they created a gap that took 88 minutes to close.

The decision to use conservative defaults traces to an incident three months prior where an aggressive autoscaler triggered enough provisioning requests to exhaust API quotas on the underlying virtual machine platform, causing a different class of failure. The fix for that incident made this one worse.

Why The Status Page Lagged Behind Reality

There is a detail in the timeline worth sitting with. The queue started accumulating at 14:07. The status page showed "Degraded Performance" at 14:23 — sixteen minutes later.

Status page updates at GitHub require two things: a monitoring alert to fire, and an on-call engineer to assess and post an update. The monitoring alert fired at 14:14 when the queue depth exceeded a threshold that triggered a PagerDuty page. The engineer who received it acknowledged at 14:19 and updated the status page at 14:23. Total human-in-the-loop delay: nine minutes from alert to status page.

For engineers who triggered workflows at 14:08, that sixteen-minute gap was invisible. They saw a workflow in the "Queued" state and assumed their code was the problem.

Monitoring that fires on queue depth rather than queue wait time introduces this delay. A job that has been queued for ten seconds and a job that has been queued for ten minutes look identical on a depth graph until you look at age distribution. GitHub added a p95 queue wait time alert as a remediation item from this incident.

What Changed Afterward

Three changes were documented in the incident follow-up.

The first is the most impactful: predictive autoscaling. Rather than reacting to observed queue depth, the new controller watches incoming job submission rates and begins provisioning proactively when the rate of increase exceeds a threshold — before the warm pool is exhausted. The provisioning lead time is treated as a parameter to plan around rather than a constant to accept.

The second is a tiered provisioning cap. Instead of a flat 200-runner-per-minute ceiling, the new cap scales with the rate of queue growth. When queue growth is above 3 times baseline, the cap rises to 600 per minute. The underlying compute API rate limits are managed by a quota reservation system that now holds emergency capacity for exactly this scenario.

The third is the queue wait time alert. P95 queue wait time greater than 90 seconds triggers a page. This fires before queue depth does, because a growing wait time indicates a pool problem before queue depth becomes anomalous.

John Allspaw, who led engineering operations at Etsy and later Adaptive Capacity Labs, wrote that effective incident response requires recognizing that monitoring systems are themselves models of the world, and models fail at the edges. A monitoring system tuned for the average case will miss the incidents that matter most — which tend to be precisely the above-average cases.

The wait time metric is a better model of what users experience. Queue depth is a better model of server-side load. In this incident, they diverged. The wait time metric would have caught the divergence earlier.

Lesson from the architecture: autoscaling systems are control loops with lag. Lag is not a bug — it is inherent in any reactive system. The design question is whether the lag is acceptable under your worst-case traffic scenario, not just your average scenario. When you set autoscaling parameters, you are making an implicit assumption about your peak-to-average ratio. On April 22, that assumption was wrong.

Build the system yourself

Reading about failures is useful. Understanding why they happen means building these systems yourself and experiencing the failure modes directly.

Browse all tracks