Post-MortemSEV11h 10m

Cache Expiration, Replica Lag, and How 40% of GitHub Went Dark

A caching bug introduced in a routine deployment caused cache recalculation to read from a lagging read replica. The result was a cascade of stale cached data serving incorrect responses to 40% of github.com requests and 43% of API calls.

SystemGitHub.com / GitHub API

DateMarch 3, 2026

Duration1h 10m

The Architecture Behind the Bug

At GitHub's scale, direct reads against a single primary database aren't viable for most request types. The standard pattern is to maintain read replicas that asynchronously receive writes from the primary and handle the bulk of read traffic. Writes go to the primary. Reads go to replicas.

Cache hit: serve directly. Cache miss: read replica → populate → serve. Normally fine.

This works well almost all the time. Replication lag is measured in milliseconds under normal load. The replica is close enough to current that the difference doesn't matter for most reads.

The deployment on March 3 introduced a bug in the cache recalculation path. When a cache entry expired and needed to be rebuilt, the code read from the replica. Under the load of production traffic, the replica was behind the primary by enough that the recalculated cache entry contained stale user data, missing recent writes. The stale entry was then stored in the cache and served to subsequent requests.

What "Stale Cache Entry" Means in Practice

This wasn't data from five minutes ago being served. Even a few seconds of lag in the wrong context breaks things. Consider a user who just changed their repository permissions. The primary has the new state. The replica doesn't, not yet. A cache miss for that user's permission data hits the replica, gets the old state, caches it, and now every request for the next TTL interval serves the wrong permissions.

Why Testing Didn't Catch This

This is a timing-dependent failure. In a staging environment, replication lag is effectively zero because write throughput is minimal. The test suite cannot replicate production write volume, so it never creates the conditions under which lag accumulates enough to matter.

The bug was dormant in staging and alive in production from the moment the deployment landed.

The 43% API Number

The API failure rate was slightly higher than the general github.com rate. API callers tend to be more intensive users of the platform: CI systems, IDE integrations, deployment tools, bots. These access patterns correlate heavily with the permission and authentication data the cache was incorrectly serving. A CI token checking repository access, an IDE checking branch permissions, both of these hit exactly the stale data path.

What Changed

Two things: rolling back the faulty deployment, and adding killswitches to monitoring. The rollback is obvious. The killswitch is the more interesting engineering decision.

A killswitch in a caching system lets engineers disable a specific caching code path instantly without doing a full deployment rollback. When the cache recalculation path is producing bad data, the safest option is to bypass it entirely, route reads to the primary, and accept the load increase as a controlled trade-off. Higher database load is a better problem to have than incorrect data served at scale.

The correct fix at the architecture level is to route cache recalculation reads to the primary rather than the replica. This costs more in database load for cache misses but eliminates the lag coupling entirely. For data where correctness is more important than read throughput (user permissions, auth context, session state), this is the right default.

The Pattern

Cache miss handlers that read from replicas are latent bugs waiting for a write spike. Your cache's correctness is coupled to your replication lag. That coupling is invisible during normal operation and catastrophic when it surfaces.

The right rule of thumb: if you would be uncomfortable serving stale data from that cache for 30 seconds, route its recalculation reads to the primary. The performance cost is bounded. The cost of serving wrong data to 40% of your users, at scale, is not.

Build the system yourself

Reading about failures is useful. Understanding why they happen means building these systems yourself and experiencing the failure modes directly.

Browse all tracks