fix: auto-heal corrupted OCI local store by forcing re-pull #1455
+131
−38
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What I did
Related issue
Fixes #1448
What was the bug
When an OCI artifact was already present in the local content store, the system assumed it was valid and tried to load it unconditionally.
If the local store became corrupted or partially written (for example, interrupted downloads, invalid tar layers, or missing metadata), the following happened:
ErrStoreCorrupted.remote.Pullwas not sufficient, because the reference already existed locally.In practice, a broken local cache could brick the OCI source resolution entirely.
Explain With Diagrams
OCI source resolution — old behavior (bug)
Description
In the previous implementation, OCI-based agent configurations were loaded from the local content store without any reliable recovery mechanism.
When an agent was requested, the OCI source attempted to read the artifact directly from the local
content.Store. If any file inside the store was missing or inconsistent (for example, a missing reference file, tarball, or metadata), the store returnedErrStoreCorrupted.At this point, the error was treated as fatal.
What went wrong
Once
ErrStoreCorruptedwas returned:This created a persistent failure mode where a transient or partial disk issue resulted in a permanently broken agent cache. The system had no mechanism to invalidate or repair a broken local reference.
Impact
From the user’s perspective, this surfaced as intermittent but unrecoverable errors such as:
<name>:latestnot found”Even though the remote artifact was valid and accessible, the local corruption prevented recovery.
New behavior (self-healing OCI store flow)
Description
With the new implementation, the local OCI content store is treated as a recoverable cache rather than a source of truth.
When an agent is requested from an OCI reference, the system follows a multi-step fallback strategy that guarantees recovery from partial or inconsistent local state.
Step-by-step flow
Local load attempt
Normal OCI pull (safe revalidation)
Corruption detection
ErrStoreCorrupted, the store is considered inconsistent.Forced re-pull (store repair)
Final retry
Key guarantees
Outcome
This change converts a hard failure scenario into a self-healing process, ensuring that agent execution remains reliable even in the presence of local cache corruption.
Failure scenarios and recovery boundaries
Description
This diagram highlights the different failure scenarios that can occur when loading agents from OCI artifacts and clearly defines where recovery is possible and where it must stop.
The goal is to avoid infinite retries while still guaranteeing automatic repair whenever feasible.
Failure scenarios covered
Missing reference link
store/refs/does not exist.Missing or unreadable tarball
<digest>.tarfile is missing or cannot be parsed.Invalid image structure
Empty or unreadable layers
Remote pull failure
Recovery boundaries
If local corruption is detected:
If remote pull fails and no valid local copy exists:
If remote pull fails but a valid local copy exists:
Safety guarantees
Outcome
This model ensures that the system aggressively heals local state when possible, while still failing fast and transparently when recovery is genuinely impossible.
Notes
content.ErrStoreCorrupted.Important
No automated tests were added as part of this change.
The fix was validated through manual builds and targeted reasoning over failure scenarios.
OS / System