fix(mount): retry transient cursor resolution instead of full export (#1499)#221
Conversation
…(#1499) On a reused/churned sticky box the restart fast-path resolved the events cursor under a single 20s deadline. Under cloud API load that resolution hit `context deadline exceeded`, and the daemon fell through to a full `/fs/export` bootstrap. The full export then exceeded the 90s bootstrap watchdog (and the cloud per-fire flushTimeoutSeconds:20 → exit 124), leaving the state dir non-empty without a completed bootstrap — so the next box reuse repeated the same expensive path. That is the #1499 warm-degrade stall loop. Fix the cursor fast-path to be resumable + tolerant of transient slow-API windows so box reuse re-syncs incrementally rather than full-exporting: - resolveLatestEventCursor now retries timeout-class failures (context.DeadlineExceeded / net.Error timeout) with exponential backoff (resolveLatestEventCursorOnce + isCursorResolutionRetryable + cursorResolutionRetryDelay), each attempt getting its own rootCtx-derived deadline. - Raise the default cursor-resolution deadline 20s → 60s (defaultCursorTimeout + both CLI/daemon --cursor-timeout flags). - When cursor resolution still fails with a retryable timeout, the restart fast-path returns the cycle error instead of collapsing reusable state (BootstrapComplete + tracked files) into a full export. The next reconcile retries the cheap cursor path, breaking the full-export loop; on-disk mirror state is preserved. Tests (internal/mountsync/syncer_test.go): - TestPullReusedMountWithPersistedCursorUsesIncrementalOnly: a reused mount with a persisted EventsCursor takes the incremental path only — no full tree pull, no cursor re-resolution. - TestPullRestartFastPathRetriesCursorDeadlineBeforeFullPull: a transient cursor deadline on restart retries and seeds the cursor incrementally — no full pull. - TestPullRestartFastPathTimeoutDoesNotFallBackToFullPull: a persistent cursor deadline exhausts retries and fails the cycle without a full pull. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThe PR increases the ChangesCursor Timeout Increase and Retry Logic
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Relayfile Eval ReviewRun: Passed: 4 | Needs human: 0 | Reviewable: 0 | Missing output: 0 | Failed: 0 | Skipped: 0 Human Review CasesNo reviewable human-review cases captured Relayfile output. |
There was a problem hiding this comment.
Code Review
This pull request increases the default cursor timeout from 20 to 60 seconds and introduces a retry mechanism with exponential backoff for transient cursor resolution failures. It also prevents falling back to an expensive full pull when a cursor resolution fails due to a retryable error (such as a timeout). The review comments identify two important issues: first, if the caller's context is cancelled, the syncer could incorrectly fall back to a heavy full pull instead of aborting immediately; second, the retry loop does not immediately abort when the caller's context is cancelled, leading to unnecessary delays and log spam. Both comments provide actionable code suggestions to check ctx.Err() to resolve these issues.
| if errors.As(err, &httpErr) && httpErr.StatusCode == http.StatusNotFound { | ||
| // No events feed on this backend — fall through to the | ||
| // full-pull bootstrap path. (Pre-fix behaviour.) | ||
| } else if isCursorResolutionRetryable(err) { |
There was a problem hiding this comment.
If the caller's context ctx is cancelled (e.g., context.Canceled), isCursorResolutionRetryable(err) returns false. This causes the syncer to fall through to the else block and trigger a heavy full pull/bootstrap in the background (since bootstrapContext derives its context from s.rootCtx rather than ctx). Checking ctx.Err() != nil ensures we abort immediately if the cycle's context is no longer active, preventing unnecessary resource consumption and potential server overload.
| } else if isCursorResolutionRetryable(err) { | |
| } else if ctx.Err() != nil || isCursorResolutionRetryable(err) { |
| return cursor, nil | ||
| } | ||
| lastErr = err | ||
| if !isCursorResolutionRetryable(err) || attempt == attempts-1 { |
There was a problem hiding this comment.
If the caller's context ctx is cancelled or times out during an attempt, resolveLatestEventCursorOnce will return context.Canceled or context.DeadlineExceeded. Since isCursorResolutionRetryable(context.DeadlineExceeded) is true, the retry loop will continue to wait for the retry delay and attempt the next iteration, even though any subsequent attempt is guaranteed to fail immediately. Checking ctx.Err() != nil allows us to abort the retry loop immediately, avoiding useless log spam and unnecessary sleep delays.
| if !isCursorResolutionRetryable(err) || attempt == attempts-1 { | |
| if ctx.Err() != nil || !isCursorResolutionRetryable(err) || attempt == attempts-1 { |
There was a problem hiding this comment.
✅ No new issues found.
Reviewed changes — Fixes a production stall loop where transient cursor-resolution timeout on reused sticky boxes collapsed into a full-tree export, which then exceeded the bootstrap watchdog and flush timeout, leaving state non-empty without completed bootstrap.
- Retry cursor resolution with exponential backoff (
internal/mountsync/syncer.go) —resolveLatestEventCursornow delegates toresolveLatestEventCursorOncein a retry loop (3 attempts, 250ms base delay, 2× per attempt, capped at min(2s, timeout/10)). Each attempt gets its ownrootCtx-derived deadline. - Classify timeout errors as retryable —
isCursorResolutionRetryableidentifiescontext.DeadlineExceededandnet.Errortimeouts; non-retryable errors (404 missing events feed, etc.) preserve the pre-existing fall-through to full export. - Break the full-export loop — In
pullRemote, when cursor resolution fails with a retryable error even after exhausting retries, the error is surfaced instead of falling through to full-pull bootstrap. On-disk mirror state is preserved; the next reconcile retries the cheap cursor path. - Raise cursor timeout 20s → 60s (
cmd/relayfile-cli/main.go,cmd/relayfile-mount/main.go,internal/mountsync/syncer.go) — matches the internal default and both CLI flag defaults. - Tests (
internal/mountsync/syncer_test.go) — Three new tests cover: (1) persisted cursor reuse skips full pull and cursor resolution entirely, (2) transient deadline triggers one retry then succeeds, (3) persistent deadline exhausts retries and returns the error without falling through to full pull.
There was a problem hiding this comment.
2 issues found across 4 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="internal/mountsync/syncer.go">
<violation number="1" location="internal/mountsync/syncer.go:2466">
P2: If the caller's `ctx` is cancelled, the resulting `context.Canceled` error is not matched by `isCursorResolutionRetryable`, causing the code to fall through to the `else` branch and trigger an expensive full pull/bootstrap. Add a `ctx.Err() != nil` guard before the retryable check to return early when the cycle context is already dead, avoiding unnecessary resource consumption.</violation>
<violation number="2" location="internal/mountsync/syncer.go:4040">
P2: If the caller's context `ctx` is cancelled during an attempt, the error from `resolveLatestEventCursorOnce` may still be `context.DeadlineExceeded` (from the `s.rootCtx`-derived timeout), which passes `isCursorResolutionRetryable`. The loop then sleeps via `waitWithContext(s.rootCtx, delay)` which doesn't observe `ctx`, causing unnecessary delays. Check `ctx.Err() != nil` to bail out immediately when the caller's context is done.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| return cursor, nil | ||
| } | ||
| lastErr = err | ||
| if !isCursorResolutionRetryable(err) || attempt == attempts-1 { |
There was a problem hiding this comment.
P2: If the caller's context ctx is cancelled during an attempt, the error from resolveLatestEventCursorOnce may still be context.DeadlineExceeded (from the s.rootCtx-derived timeout), which passes isCursorResolutionRetryable. The loop then sleeps via waitWithContext(s.rootCtx, delay) which doesn't observe ctx, causing unnecessary delays. Check ctx.Err() != nil to bail out immediately when the caller's context is done.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/mountsync/syncer.go, line 4040:
<comment>If the caller's context `ctx` is cancelled during an attempt, the error from `resolveLatestEventCursorOnce` may still be `context.DeadlineExceeded` (from the `s.rootCtx`-derived timeout), which passes `isCursorResolutionRetryable`. The loop then sleeps via `waitWithContext(s.rootCtx, delay)` which doesn't observe `ctx`, causing unnecessary delays. Check `ctx.Err() != nil` to bail out immediately when the caller's context is done.</comment>
<file context>
@@ -4013,12 +4026,42 @@ func (s *Syncer) markIncrementalCheckpoint(pageStartCursor, pageCursor, phase, r
+ return cursor, nil
+ }
+ lastErr = err
+ if !isCursorResolutionRetryable(err) || attempt == attempts-1 {
+ return "", err
+ }
</file context>
| if !isCursorResolutionRetryable(err) || attempt == attempts-1 { | |
| if ctx.Err() != nil || !isCursorResolutionRetryable(err) || attempt == attempts-1 { |
| if errors.As(err, &httpErr) && httpErr.StatusCode == http.StatusNotFound { | ||
| // No events feed on this backend — fall through to the | ||
| // full-pull bootstrap path. (Pre-fix behaviour.) | ||
| } else if isCursorResolutionRetryable(err) { |
There was a problem hiding this comment.
P2: If the caller's ctx is cancelled, the resulting context.Canceled error is not matched by isCursorResolutionRetryable, causing the code to fall through to the else branch and trigger an expensive full pull/bootstrap. Add a ctx.Err() != nil guard before the retryable check to return early when the cycle context is already dead, avoiding unnecessary resource consumption.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/mountsync/syncer.go, line 2466:
<comment>If the caller's `ctx` is cancelled, the resulting `context.Canceled` error is not matched by `isCursorResolutionRetryable`, causing the code to fall through to the `else` branch and trigger an expensive full pull/bootstrap. Add a `ctx.Err() != nil` guard before the retryable check to return early when the cycle context is already dead, avoiding unnecessary resource consumption.</comment>
<file context>
@@ -2459,6 +2463,15 @@ func (s *Syncer) pullRemote(ctx context.Context, conflicted map[string]struct{})
if errors.As(err, &httpErr) && httpErr.StatusCode == http.StatusNotFound {
// No events feed on this backend — fall through to the
// full-pull bootstrap path. (Pre-fix behaviour.)
+ } else if isCursorResolutionRetryable(err) {
+ // A completed prior bootstrap with tracked files is reusable
+ // state. Under load, falling through from a transient cursor
</file context>
| } else if isCursorResolutionRetryable(err) { | |
| } else if ctx.Err() != nil || isCursorResolutionRetryable(err) { |

Problem (#1499 relayfile warm-degrade stall loop)
On a reused/churned sticky box the restart fast-path resolved the events cursor under a single 20s deadline (
resolveLatestEventCursor,internal/mountsync/syncer.go). Under cloud/fs/exportAPI load that resolution hitcontext deadline exceeded, so the daemon fell through to a full-tree bootstrap export. The full export then:flushTimeoutSeconds:20→ exit 124,leaving the state dir non-empty without a completed bootstrap. The next box reuse repeated the same expensive path — the sticky bad-state loop. It only escaped when a clear API window happened to let one full pull complete.
Root cause (file:line)
internal/mountsync/syncer.goresolveLatestEventCursor— one attempt, one fixedcursorTimeout(20s) deadline; any timeout returned an error.internal/mountsync/syncer.gorestart fast-path inpullRemote(cursor-resolution-failed branch) — on any error it fell through to the full-pull/export bootstrap, including a transient timeout against already-reusable state (BootstrapComplete+ tracked files).Fix
Make the cursor fast-path tolerant of transient slow-API windows so box reuse re-syncs incrementally instead of full-exporting:
resolveLatestEventCursornow retries timeout-class failures (context.DeadlineExceeded/net.Errortimeout) with exponential backoff —resolveLatestEventCursorOnce+isCursorResolutionRetryable+cursorResolutionRetryDelay. Each attempt gets its ownrootCtx-derived deadline.defaultCursorTimeout+ both--cursor-timeoutflags incmd/relayfile-mountandcmd/relayfile-cli).Non-timeout /
404(no events feed) errors keep the prior fall-through-to-full-pull behavior — this change is scoped to the transient timeout class.Tests (
internal/mountsync/syncer_test.go) — all GREENTestPullReusedMountWithPersistedCursorUsesIncrementalOnly— a reused mount with a persistedEventsCursortakes the incremental path only: no full tree pull, no cursor re-resolution.TestPullRestartFastPathRetriesCursorDeadlineBeforeFullPull— a transient cursor deadline on restart retries and seeds the cursor incrementally — no full pull.TestPullRestartFastPathTimeoutDoesNotFallBackToFullPull— a persistent cursor deadline exhausts retries and fails the cycle without a full pull.Full
go test ./internal/mountsync/passes (~103s);go vet+gofmtclean.Rollout
This is a change to the Go
relayfile-mountdaemon binary, distributed via GitHub releases (scripts/install.sh→releases/download/${RELAYFILE_VERSION:-latest}). After merge it needs a relayfile release/version bump (e.g.v0.8.4→v0.8.5) so cloud-warmed sandboxes install the fixed daemon. Cloud must pin the newRELAYFILE_VERSION(or take latest) where it installsrelayfile-mount.🤖 Generated with Claude Code