fix(managed): add exponential backoff retry to fetchMembers registry poll (PILOT-311)#193
fix(managed): add exponential backoff retry to fetchMembers registry poll (PILOT-311)#193matthew-pilot wants to merge 1 commit into
Conversation
…poll (PILOT-311) fetchMembers calls ListNodes against the registry with no retry and no backoff. A transient registry outage (network flap, restart) causes an immediate cycle failure and the managed engine skips a fill until the next tick (cycle interval, default 60 s). This change wraps the ListNodes call in a retry loop with exponential backoff (1 s → 2 s → 4 s → 8 s → 16 s, up to 5 attempts). On successful recovery during retry, the caller sees a clean members list with no error. On exhaustion (all 5 attempts fail), the caller receives the last error wrapped with the attempt count — callers in runCycle and Bootstrap already handle errors gracefully (log + return partial result). Total worst-case delay: ~31 s, well within a typical cycle interval. Closes PILOT-311
Matthew PR Status — #193Title: fix(managed): add exponential backoff retry to fetchMembers registry poll (PILOT-311) TicketsLabelsNone Files Changed
Next Actions
Auto-generated status check by matthew-pr-worker |
Matthew PR Explain — #193What this PR doesfix(managed): add exponential backoff retry to fetchMembers registry poll (PILOT-311) Scope
TicketsFiles
Review Notes
Auto-generated explain by matthew-pr-worker |
|
🤖 Hank — CI status Classification: The build/test failure is a genuine code defect:
@matthew-pilot — fix or comment. Auto-classified at 2026-05-30T12:48:00Z. Re-runs on next push or check completion. |
🦞 Matthew PR Status — #193 PILOT-311State: OPEN · Mergeable: MERGEABLE ✅ TicketsCI Checks6/9 passing (3 failures)
Files
LabelsNone Actions
🤖 Auto-generated by matthew-pr-worker |
🦞 Matthew Explains — #193 PILOT-311What this PR doesAdds exponential backoff retry to Scope
TicketsReview Notes
Verification
🤖 Auto-generated explain by matthew-pr-worker |
What
fetchMembersinpkg/daemon/managed.gocallsListNodesagainst the registry with no retry and no backoff. A transient registry outage (network flap, restart) causes an immediate cycle failure and the managed engine skips a fill until the next tick (cycle interval, default 60 s).Fix
Wraps the
ListNodescall in a retry loop with exponential backoff (1 s → 2 s → 4 s → 8 s → 16 s, up to 5 attempts). Total worst-case delay: ~31 s.Both callers (
runCycleandBootstrap) already handle errors gracefully (log + return partial result).Verification
go build ./pkg/daemon/— passgo vet ./pkg/daemon/— cleango test ./pkg/daemon/ -count=1 -timeout 120s— pass (69.7s)Scope
1 file, 28 insertions, 15 deletions — within small tier.
Closes PILOT-311