Skip to content

docker: add blob-level mirror fallback in dockerImageSource#845

Open
QiWang19 wants to merge 1 commit into
containers:mainfrom
QiWang19:fallback-mirror
Open

docker: add blob-level mirror fallback in dockerImageSource#845
QiWang19 wants to merge 1 commit into
containers:mainfrom
QiWang19:fallback-mirror

Conversation

@QiWang19
Copy link
Copy Markdown
Member

@QiWang19 QiWang19 commented May 14, 2026

Summary

In disconnected/restricted environments, a mirror can pass the manifest
check in newImageSource() but then fail blob downloads (5xx, timeout,
or partial mirror returning BLOB_UNKNOWN). When this happens, the copy
fails and CRI-O propagates the error to kubelet — which retries from
scratch, picks the same broken mirror, and deadlocks.

This PR adds blob-level mirror fallback inside dockerImageSource so
that when a blob fetch fails, remaining configured mirrors are tried
before giving up. The fallback is transparent to all callers (CRI-O,
Podman, Buildah, Skopeo).

Design choices

Fallback triggers: Only transient errors (5xx, network timeout) and
blob-not-found (BLOB_UNKNOWN) trigger fallback. Other 4xx errors (auth
failures, etc.) are fatal — they indicate a configuration problem, not a
mirror issue.

Serialized mirror probing: getBlobWithMirrorFallback() holds
mirrorMu for the entire probe loop. This is intentional: the first
goroutine to hit a failure walks the mirror list while the rest block on
getActiveSource(), then reuse the discovered working mirror via
mirrorOverride. This avoids N goroutines independently exhausting the
mirror list.

No manifest re-fetch on fallback: Blob digests from the original
manifest are used against fallback sources. Safe because digests are
content-addressed. newImageSourceAttempt() calls
ensureManifestIsLoaded() on each candidate, which filters out mirrors
that can't serve the manifest at all.

remainingSources includes mirrors AND the primary location: The
fallback loop tries whatever pullSources entries come after the
initially-selected source, which may include the original upstream
registry.

Same-mirror retry on transient errors: tryGetBlob() retries once
with 1s+jitter delay before moving to the next source. This handles
brief 503 spikes without immediately triggering the full fallback
machinery.

Questions

  • Are the right errors triggering fallback in isMirrorTransientError() / isMirrorFallbackError()? Should any other error types be included or excluded?
  • The full-loop lock in getBlobWithMirrorFallback() serializes all blob goroutines during fallback — is this acceptable, or would you prefer a different concurrency strategy?
  • Should tryGetBlob() retry once on the same mirror before falling back, or should any failure immediately try the next source?

Fixes: https://issues.redhat.com/browse/OCPSTRAT-3170

@github-actions github-actions Bot added the image Related to "image" package label May 14, 2026
Copy link
Copy Markdown
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Doing this at the copy.Image level is very surprising, and harder than necessary — and less resilient, because it still assumes that a full image copy can proceed uninterrupted from a single mirror.

I’d expect dockerImageSource, instead, to do the fallbacks — making physicalRef and the dockerClient uses mutable (careful about HasThreadSafeGetBlob: true!), or perhaps maintaining an on-demand-extended set of dockerClients, in the worst case one per mirror; then every GetBlob/… on the dockerImageSource could fall back to another mirror (and we can have endless tuning discussions about the tuning logic there)…

Either way the generic copy code, AFAICS, does not need to know.

Comment thread image/types/types.go Outdated
Comment on lines +682 to +685
// If not nil, DockerExcludedPullSources lists endpoint Location that NewImageSource skips when selecting a pull source.
DockerExcludedPullSources map[string]struct{}
// If DockerExcludedPullSources is not nil, the Location of the pull source selected by the most recent NewImageSource call.
DockerLastSelectedPullSource string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is conceptually an internal state of dockerImageSource and I see no reason to make these public and store them so far from there. In the very worst case, we could use a new private interface for that (compare #531 adding NewImageSourceWithOptions) but even that seems suboptimal, see elsewhere.

Comment thread image/copy/copy.go
if err != nil {
return nil, fmt.Errorf("initializing source %s: %w", transports.ImageName(srcRef), err)
}
rawSource := imagesource.FromPublic(publicRawSource)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that we convert from types.ImageSource to private.ImageSource at the earliest opportunity, and then we never need to worry about public-only implementations (so much).

Comment thread image/copy/copy.go Outdated
signersToClose []*signer.Signer // Signers that should be closed when this copier is destroyed.
}

type pinnedManifestSource struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary in principle … dockerImageSource already has a cachedManifest (but it does require maintain that state carefully).

@mtrmac
Copy link
Copy Markdown
Contributor

mtrmac commented May 14, 2026

BTW the non-timeout test failures look real at a first glance.

When a blob fetch fails with a transient error (5xx, timeout) or
a blob-not-found (BLOB_UNKNOWN), try remaining configured mirrors
before giving up. The fallback is transparent to all callers
(CRI-O, Podman, Buildah, Skopeo).

The mirror probe is serialized under mirrorMu: the first goroutine
to hit a failure walks the mirror list while the rest block on
getActiveSource(), then reuse the discovered working mirror.

Signed-off-by: Qi Wang <qiwan@redhat.com>
@QiWang19 QiWang19 changed the title copy: Fall back to the next configured mirror on copy failure docker: add blob-level mirror fallback in dockerImageSource May 20, 2026
@QiWang19
Copy link
Copy Markdown
Member Author

@mtrmac I updated this PR blob-level mirror fallback. I'd appreciate an early review on the approach before I continue. And there are a few design questions I'd like your input on(in the PR description under "Questions.")

Copy link
Copy Markdown
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I’m afraid there is way too much going on (e.g. we need to migrate of Cirrus) and I’ll be only able to pay proper attention to this, probably, weeks later.

For now just a very minimal look: Yes, I’d expect the state to be in dockerImageSource ~only. I didn’t carefully read the logic.

Are the right errors triggering fallback in isMirrorTransientError() / isMirrorFallbackError()?

Looks plausible; I’m sure we will be turning the heuristics 10 years from now… Anyway, thanks for clearly splitting the logic into separate documented functions, that will help future maintenance.

Should any other error types be included or excluded?

I think bodyReader already handles another set of errors. [There is a question whether bodyReader should also trigger a fallback to a different mirror… I think that would better be a different PR, and it might not even be necessary — OTOH there it might be worth just thinking a bit about what the code structure to enable that would look like.]

The full-loop lock in getBlobWithMirrorFallback() serializes all blob goroutines during fallback — is this acceptable, or would you prefer a different concurrency strategy?

Without looking at the details, I think this is a great idea.

Should tryGetBlob() retry once on the same mirror before falling back, or should any failure immediately try the next source?

Hum… in general, it’s problematic to have retry loops in lower levels of code; if each level of a call stack has a retry loop, that makes retry behavior exponential in the depth of the call stack. So as a general principle, I’ve been told that retries should exist only at the top level if possible. (Which is why GetBlob/GetManifest has no retries itself; callers can use c/common/pkg/retry at a higher level. OTOH bodyReader exists because the retry behavior in that case is usefully different from retrying from scratch.)

So I think GetBlob should ideally not retry at all if no mirrors are involved. If we do have mirrors, I don’t have a strong opinion. Given how costly / disruptive / potentially unexpected / (likely to fail in many configuration?) a fallback to another mirror can be, I think one more retry before starting the mirror fallback can well make sense.

return s.getBlob(ctx, info, cache)
}

func (s *dockerImageSource) getBlob(ctx context.Context, info types.BlobInfo, cache types.BlobInfoCache) (io.ReadCloser, int64, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other caller than GetBlob itself?

Comment on lines +610 to +611
// newImageSourceAttempt calls ensureManifestIsLoaded(), which preserves
// the manifest-level filtering: only mirrors that can serve the manifest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not?! If we got go the getBlob stage, cachedManifest is already set and ensure… does nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

image Related to "image" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants