fix: cap supervisor verify-fail loop at maxVerifyFails consecutive retries (PILOT-110)#4
Conversation
…tries (PILOT-110) superviseOne verify-fail path looped forever on persistent verifyBinary failure (corrupt/deleted/tampered binary). Added: - maxVerifyFails const (10 consecutive fails) - verifyFails tracking in supervise loop - markSuspended() helper for non-crash-loop suspension - On cap: write suspended marker + audit suspend event + return The verify-fail retry with backoff is preserved below the cap so transient issues (e.g. NFS lag, brief corruption) still get a chance to self-heal.
|
canary OK — smoke test passed on baseline cluster (run https://github.com/pilot-protocol/pilot-canary/actions/runs/26564364450) |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
PR Status —
|
| Check | Result |
|---|---|
test (ci) |
✅ SUCCESS |
security/snyk |
✅ SUCCESS |
codecov/patch |
❌ FAILURE (diff coverage below threshold; 1 file, +46/−7 lines) |
Canary: Not triggered for this branch (PR author opted out — change is supervisor-internal, no protocol change). Latest main-branch canary (run 26564364450, 08:42 UTC) — ✅ SUCCESS.
Linked Jira: PILOT-110 — "App-store supervisor: verify-fail backoff loop has no max iteration cap" — Status: QA/IN-REVIEW
Operator activity: None since PR open. This is matthew-pilot-authored, no operator comments/reviews yet.
What this PR changes —
|
What
Closes PILOT-110: The supervisor's
superviseOneverify-fail path looped forever on persistentverifyBinaryfailure (corrupt, deleted, or tampered binary). There was no upper bound — the exponential backoff'smaxBackoff=30scapped individual waits but the loop itself was unbounded.Fix
maxVerifyFails = 10constant — after 10 consecutive verify failures, the supervisor suspends the app (same suspension mechanism as crash-loop).verifyFailstracking in thesuperviseOneloop; counter resets on successful spawn.markSuspended()helper for non-crash-loop suspension paths..suspendedsentinel, logs audit suspend event, callsmarkSuspended, and returns.Scope
plugin/appstore/supervisor.goTesting
go test ./...)TestSuperviseOne_VerifyFail*tests pass (they exercise ctx-cancel path which is unchanged)Labels
canary: app-store canary = smoke only. Not triggering — the change is purely supervisor-internal (no daemon<>app protocol change, no new binaries).