Post-MortemSEV12h 08m

The CDN Config Push That Made Docker Pulls Fail for Two Hours

A configuration push to GitHub's CDN layer introduced an incorrect cache-control directive that caused container image layer requests to return stale 404 responses, breaking docker pull for any image modified in the preceding 72 hours.

SystemGitHub Packages / Container Registry

DateApril 28, 2026

Duration2h 08m

How Container Image Distribution Works

A container image is not a single file. It is a manifest — a JSON document listing a set of content-addressed layer blobs — and a collection of those blobs. When you run docker pull nginx:latest, the client makes two categories of requests to the registry.

First, it fetches the manifest by tag or digest. The manifest tells the client which layer blobs it needs. Second, for each layer the client does not already have locally, it fetches the blob by its content digest (a SHA-256 hash of the layer content). Both the manifest and the blob endpoints are served through GitHub's CDN layer before the origin registry handles them.

Content-addressed storage has a useful property: a blob identified by its SHA-256 digest is immutable. If you successfully fetch a blob once, you can cache it forever — the content will never change under the same hash. This is why CDN caching of layer blobs is safe and sensible.

What is not safe to cache: a 404 response for a blob. A 404 means the CDN tried to fetch the blob from origin and origin said it does not exist. That response is correct at the moment origin says it. It may not be correct ten minutes later when the upload propagates. And it is categorically wrong after the blob is confirmed to exist on origin — which is the case for any image that was pushed successfully before the misconfiguration.

What the Config Change Actually Did

The configuration change that went out at 09:41 UTC was a YAML file updating CDN cache policies for GitHub Packages. The intent was to reduce manifest TTLs from 60 seconds to 15 seconds to improve freshness for users who tag and push new image versions frequently.

The change was reviewed by two engineers and passed a configuration linting step that checked for syntax errors. Neither the review nor the lint caught the substantive error: the new policy included a stanza that set cache_404_responses: true for the blob endpoint path pattern. This directive was copied from a different CDN service configuration for a static assets service, where caching 404s is appropriate — a missing static asset that was legitimately deleted should be cached to reduce load.

The wrong stanza in the wrong file. The configuration linter did not compare stanza semantics against the endpoint category.

The CDN cached the first 404 response it received for each new blob path — which happened because the CDN queried origin during the few-second window between the blob upload request arriving and the upload completing in object storage. Origin returned 404. CDN cached that 404 for 300 seconds. When the upload completed and origin would now return 200, the CDN did not know and did not ask — it had a fresh cached response.

This is not a bug in the CDN. It is the CDN behaving exactly as configured. The bug was telling it to cache 404s at all.

The Detection Problem Was the Error Message

Engineers who hit broken pulls received errors like Error response from daemon: pull access denied or manifest unknown: manifest unknown. These are the same errors you see when the image does not exist, your credentials are wrong, or you have a typo in the image name. There is nothing in those messages that suggests "the CDN has incorrect state about a layer that definitely exists."

The first signal that this was infrastructure-side rather than user-side came from an internal monitoring alert on GitHub's registry origin request rate. At 10:14 UTC, origin request volume dropped sharply — a counterintuitive signal. Normally a problem generates more origin requests as clients retry. Lower origin traffic meant the CDN was serving responses from cache without forwarding to origin. Combined with rising error rates in GitHub's synthetic monitoring (which tried to pull known-good images on a schedule), this was enough for an on-call engineer to begin investigation.

The status page update at 10:26 UTC — 45 minutes after the incident started — reflected how long it took to identify that the problem was CDN configuration rather than origin storage.

What Changed Afterward

The CDN configuration linter now includes semantic validation rules for endpoint categories. Blob endpoints and manifest endpoints are annotated in the configuration schema; the linter rejects cache_404_responses: true on any endpoint classified as mutable-upload-target.

The synthetic monitoring that caught the signal at 10:14 UTC was extended. It now runs pull tests for images that were pushed within the previous hour, specifically to catch the class of issue where recently uploaded content behaves differently from older content. The test runs every two minutes in five regions.

The configuration change process now includes a canary rollout step for CDN policy changes. Rather than deploying simultaneously to all CDN edge nodes, the new process deploys to 1% of traffic for five minutes, watches error rate and origin bypass rate metrics, and requires explicit approval before full rollout. The April 28 incident would have been limited to a five-minute window if this process had been in place.

Phil Karlton's observation that cache invalidation is one of the two hard problems in computer science is usually cited with a wry smile. It is worth taking it seriously. Cache invalidation is hard because cached state is, by definition, disconnected from the authoritative state it represents. Every time you introduce a cache, you introduce the possibility that the cache and the origin disagree — and that clients cannot tell the difference from their side.

Caching error responses is a sharp edge of this problem. A cached 200 that becomes stale is tolerable: you served old content. A cached 404 that becomes stale is worse: you told a client something does not exist when it does. The CDN was confident and wrong. Confident wrong answers are harder to debug than uncertain ones, because they do not invite investigation.

Lesson from the architecture: negative results (404, cache miss confirmations) are often less safe to cache than positive ones, especially in systems where the absence of a resource is a transient state rather than a permanent one. An upload in progress looks like an absent resource from the outside. Caching that absence — and serving it after the upload completes — turns a race condition into a correctness bug that persists until the cache TTL expires.

Build the system yourself

Reading about failures is useful. Understanding why they happen means building these systems yourself and experiencing the failure modes directly.

Browse all tracks