Post-Mortems|6 min read
Post-MortemSEV11h 10m

Cache Expiration, Replica Lag, and How 40% of GitHub Went Dark

A caching bug introduced in a routine deployment caused cache recalculation to read from a lagging read replica. The result was a cascade of stale cached data serving incorrect responses to 40% of github.com requests and 43% of API calls.

SystemGitHub.com / GitHub API
DateMarch 3, 2026
Duration1h 10m
Tags
cachingreplication-lagread-replicacache-invalidationgithub

Phil Karlton's observation that "there are only two hard things in computer science: cache invalidation and naming things" has been quoted so often it's become wallpaper. Most engineers hear it and think about TTL tuning or stale-while-revalidate headers. The harder version of the problem is what happens when cache recalculation itself produces the wrong answer, not because the cache logic is wrong, but because the data source it reads from is behind.

That is what happened on March 3, 2026. A deployment to GitHub introduced a bug in the caching mechanism that caused cache entries to be recalculated using data from a read replica that was behind the primary database. The freshly recalculated entries were wrong. They got served. Roughly 40% of github.com requests and 43% of GitHub API calls failed for one hour and ten minutes.

The Architecture Behind the Bug

At GitHub's scale, direct reads against a single primary database aren't viable for most request types. The standard pattern is to maintain read replicas that asynchronously receive writes from the primary and handle the bulk of read traffic. Writes go to the primary. Reads go to replicas.

NORMAL: PRIMARY + REPLICA + CACHE Primary DB all writes land here always current async replication ~5–50 ms lag Read Replica handles bulk reads may lag primary cache miss Cache Layer user data, permissions session context TTL-based expiry Users and API clients

Cache hit: serve directly. Cache miss: read replica → populate → serve. Normally fine.

This works well almost all the time. Replication lag is measured in milliseconds under normal load. The replica is close enough to current that the difference doesn't matter for most reads.

The deployment on March 3 introduced a bug in the cache recalculation path. When a cache entry expired and needed to be rebuilt, the code read from the replica. Under the load of production traffic, the replica was behind the primary by enough that the recalculated cache entry contained stale user data, missing recent writes. The stale entry was then stored in the cache and served to subsequent requests.

What "Stale Cache Entry" Means in Practice

This wasn't data from five minutes ago being served. Even a few seconds of lag in the wrong context breaks things. Consider a user who just changed their repository permissions. The primary has the new state. The replica doesn't, not yet. A cache miss for that user's permission data hits the replica, gets the old state, caches it, and now every request for the next TTL interval serves the wrong permissions.

THE BUG: STALE REPLICA READ ON CACHE MISS 1 User Write new permission set Primary DB updated ✓ 2 replication lag Read Replica still has OLD data Replica is seconds behind. Normal under write pressure. 3 Cache Miss TTL expired, recalculate reads replica Read Replica returns stale data ✕ 4 Cache Write stores stale entry 5 All Requests until TTL expires again Impact 40% of github.com requests failed · 43% of GitHub API calls failed · 1 hour 10 minutes Wrong permissions served · Auth failures · Session data inconsistencies

Why Testing Didn't Catch This

This is a timing-dependent failure. In a staging environment, replication lag is effectively zero because write throughput is minimal. The test suite cannot replicate production write volume, so it never creates the conditions under which lag accumulates enough to matter.

The bug was dormant in staging and alive in production from the moment the deployment landed.

The 43% API Number

The API failure rate was slightly higher than the general github.com rate. API callers tend to be more intensive users of the platform: CI systems, IDE integrations, deployment tools, bots. These access patterns correlate heavily with the permission and authentication data the cache was incorrectly serving. A CI token checking repository access, an IDE checking branch permissions, both of these hit exactly the stale data path.

What Changed

Two things: rolling back the faulty deployment, and adding killswitches to monitoring. The rollback is obvious. The killswitch is the more interesting engineering decision.

A killswitch in a caching system lets engineers disable a specific caching code path instantly without doing a full deployment rollback. When the cache recalculation path is producing bad data, the safest option is to bypass it entirely, route reads to the primary, and accept the load increase as a controlled trade-off. Higher database load is a better problem to have than incorrect data served at scale.

The correct fix at the architecture level is to route cache recalculation reads to the primary rather than the replica. This costs more in database load for cache misses but eliminates the lag coupling entirely. For data where correctness is more important than read throughput (user permissions, auth context, session state), this is the right default.

The Pattern

Cache miss handlers that read from replicas are latent bugs waiting for a write spike. Your cache's correctness is coupled to your replication lag. That coupling is invisible during normal operation and catastrophic when it surfaces.

The right rule of thumb: if you would be uncomfortable serving stale data from that cache for 30 seconds, route its recalculation reads to the primary. The performance cost is bounded. The cost of serving wrong data to 40% of your users, at scale, is not.

Build the system yourself

Reading about failures is useful. Understanding why they happen means building these systems yourself and experiencing the failure modes directly.

Browse all tracks