Skip to content

fix(mount): retry transient cursor resolution instead of full export (#1499)#221

Merged
khaliqgant merged 1 commit into
mainfrom
fix/daemon-cursor-fast-path
May 30, 2026
Merged

fix(mount): retry transient cursor resolution instead of full export (#1499)#221
khaliqgant merged 1 commit into
mainfrom
fix/daemon-cursor-fast-path

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

Problem (#1499 relayfile warm-degrade stall loop)

On a reused/churned sticky box the restart fast-path resolved the events cursor under a single 20s deadline (resolveLatestEventCursor, internal/mountsync/syncer.go). Under cloud /fs/export API load that resolution hit context deadline exceeded, so the daemon fell through to a full-tree bootstrap export. The full export then:

  • exceeded the 90s bootstrap watchdog, and
  • exceeded the cloud per-fire flushTimeoutSeconds:20exit 124,

leaving the state dir non-empty without a completed bootstrap. The next box reuse repeated the same expensive path — the sticky bad-state loop. It only escaped when a clear API window happened to let one full pull complete.

Root cause (file:line)

  • internal/mountsync/syncer.go resolveLatestEventCursor — one attempt, one fixed cursorTimeout (20s) deadline; any timeout returned an error.
  • internal/mountsync/syncer.go restart fast-path in pullRemote (cursor-resolution-failed branch) — on any error it fell through to the full-pull/export bootstrap, including a transient timeout against already-reusable state (BootstrapComplete + tracked files).

Fix

Make the cursor fast-path tolerant of transient slow-API windows so box reuse re-syncs incrementally instead of full-exporting:

  1. Retry with backoff. resolveLatestEventCursor now retries timeout-class failures (context.DeadlineExceeded / net.Error timeout) with exponential backoff — resolveLatestEventCursorOnce + isCursorResolutionRetryable + cursorResolutionRetryDelay. Each attempt gets its own rootCtx-derived deadline.
  2. Raise the deadline 20s → 60s (defaultCursorTimeout + both --cursor-timeout flags in cmd/relayfile-mount and cmd/relayfile-cli).
  3. Don't collapse reusable state into a full export. When cursor resolution still fails with a retryable timeout, the restart fast-path returns the cycle error instead of running a full export. The next reconcile retries the cheap cursor path; on-disk mirror state is preserved. This breaks the full-export loop.

Non-timeout / 404 (no events feed) errors keep the prior fall-through-to-full-pull behavior — this change is scoped to the transient timeout class.

Tests (internal/mountsync/syncer_test.go) — all GREEN

  • TestPullReusedMountWithPersistedCursorUsesIncrementalOnly — a reused mount with a persisted EventsCursor takes the incremental path only: no full tree pull, no cursor re-resolution.
  • TestPullRestartFastPathRetriesCursorDeadlineBeforeFullPull — a transient cursor deadline on restart retries and seeds the cursor incrementally — no full pull.
  • TestPullRestartFastPathTimeoutDoesNotFallBackToFullPull — a persistent cursor deadline exhausts retries and fails the cycle without a full pull.

Full go test ./internal/mountsync/ passes (~103s); go vet + gofmt clean.

Rollout

This is a change to the Go relayfile-mount daemon binary, distributed via GitHub releases (scripts/install.shreleases/download/${RELAYFILE_VERSION:-latest}). After merge it needs a relayfile release/version bump (e.g. v0.8.4v0.8.5) so cloud-warmed sandboxes install the fixed daemon. Cloud must pin the new RELAYFILE_VERSION (or take latest) where it installs relayfile-mount.

🤖 Generated with Claude Code

…(#1499)

On a reused/churned sticky box the restart fast-path resolved the events
cursor under a single 20s deadline. Under cloud API load that resolution
hit `context deadline exceeded`, and the daemon fell through to a full
`/fs/export` bootstrap. The full export then exceeded the 90s bootstrap
watchdog (and the cloud per-fire flushTimeoutSeconds:20 → exit 124),
leaving the state dir non-empty without a completed bootstrap — so the
next box reuse repeated the same expensive path. That is the #1499
warm-degrade stall loop.

Fix the cursor fast-path to be resumable + tolerant of transient slow-API
windows so box reuse re-syncs incrementally rather than full-exporting:

- resolveLatestEventCursor now retries timeout-class failures
  (context.DeadlineExceeded / net.Error timeout) with exponential backoff
  (resolveLatestEventCursorOnce + isCursorResolutionRetryable +
  cursorResolutionRetryDelay), each attempt getting its own rootCtx-derived
  deadline.
- Raise the default cursor-resolution deadline 20s → 60s
  (defaultCursorTimeout + both CLI/daemon --cursor-timeout flags).
- When cursor resolution still fails with a retryable timeout, the restart
  fast-path returns the cycle error instead of collapsing reusable state
  (BootstrapComplete + tracked files) into a full export. The next
  reconcile retries the cheap cursor path, breaking the full-export loop;
  on-disk mirror state is preserved.

Tests (internal/mountsync/syncer_test.go):
- TestPullReusedMountWithPersistedCursorUsesIncrementalOnly: a reused mount
  with a persisted EventsCursor takes the incremental path only — no full
  tree pull, no cursor re-resolution.
- TestPullRestartFastPathRetriesCursorDeadlineBeforeFullPull: a transient
  cursor deadline on restart retries and seeds the cursor incrementally —
  no full pull.
- TestPullRestartFastPathTimeoutDoesNotFallBackToFullPull: a persistent
  cursor deadline exhausts retries and fails the cycle without a full pull.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 6b7f7235-0392-41b4-8130-fb38e9987405

📥 Commits

Reviewing files that changed from the base of the PR and between 7b3af75 and 67cc17f.

📒 Files selected for processing (4)
  • cmd/relayfile-cli/main.go
  • cmd/relayfile-mount/main.go
  • internal/mountsync/syncer.go
  • internal/mountsync/syncer_test.go

📝 Walkthrough

Walkthrough

The PR increases the --cursor-timeout CLI default from 20s to 60s, adds retry logic with exponential backoff to cursor resolution, and prevents restart fast-path from falling back to full bootstrap when cursor resolution fails with retryable errors. Changes span CLI configuration, core syncer logic, and new test coverage.

Changes

Cursor Timeout Increase and Retry Logic

Layer / File(s) Summary
CLI flag defaults updated
cmd/relayfile-cli/main.go, cmd/relayfile-mount/main.go
The --cursor-timeout flag default is increased from 20s to 60s in both CLI tools, and help text is updated to reflect the new default.
Syncer cursor-resolution configuration
internal/mountsync/syncer.go
Default cursor timeout constant is increased to 60s, new retry attempt and backoff constants are introduced, and CursorTimeout field documentation is refreshed to document the new default and environment variable fallback behavior.
Cursor resolution retry wrapper and helpers
internal/mountsync/syncer.go
resolveLatestEventCursor is refactored into a retry wrapper that calls a per-attempt implementation with deadline derived from root context, retrying on deadline/timeout errors with exponential backoff; isCursorResolutionRetryable and cursorResolutionRetryDelay helpers classify retryable failures and compute backoff delays.
Restart fast-path error handling
internal/mountsync/syncer.go
pullRemote restart fast-path now explicitly returns retryable cursor-resolution errors instead of falling through to full bootstrap, allowing the daemon to retry cursor seeding on the next reconcile cycle.
Test infrastructure and cursor behavior coverage
internal/mountsync/syncer_test.go
Test fake client gains latestEventIDHook callback to control cursor-resolution outcomes per attempt; three new tests verify that persisted cursors skip bootstrap, restart fast-path retries before fallback, and persistent timeouts prevent fallback to full pull.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • AgentWorkforce/relayfile#166: Modifies mountsync event-cursor resolution and CursorTimeout wiring used by restart fast-path, overlapping with this PR's cursor-timeout default and retry changes.
  • AgentWorkforce/relayfile#196: Changes resolveLatestEventCursor behavior in syncer.go to prefer LatestEventID with paginated fallback, directly related to this PR's retry/timeout updates.
  • AgentWorkforce/relayfile#92: Modifies restart/fast-path logic around resolving and seeding EventsCursor from events feed, including retry vs fallback decisions.

Poem

🐰 A whisker-twitching ode to timeouts and tries:
Twenty seconds was brief, but now we're wise—
Sixty seconds of patience, with retries that flow,
Exponential backoff makes restarts go slow.
Fast-path returns errors, no bootstrap fallback,
One simple tweak keeps the relay on track!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: fixing a daemon issue by implementing retry logic for transient cursor resolution failures instead of falling back to full exports.
Description check ✅ Passed The description thoroughly explains the problem, root cause, fix, tests, and rollout requirements, all directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/daemon-cursor-fast-path

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

Relayfile Eval Review

Run: .relayfile/evals/runs/2026-05-30T07-12-07-547Z-HEAD-provider
Mode: provider
Git SHA: d3694f2

Passed: 4 | Needs human: 0 | Reviewable: 0 | Missing output: 0 | Failed: 0 | Skipped: 0

Human Review Cases

No reviewable human-review cases captured Relayfile output.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the default cursor timeout from 20 to 60 seconds and introduces a retry mechanism with exponential backoff for transient cursor resolution failures. It also prevents falling back to an expensive full pull when a cursor resolution fails due to a retryable error (such as a timeout). The review comments identify two important issues: first, if the caller's context is cancelled, the syncer could incorrectly fall back to a heavy full pull instead of aborting immediately; second, the retry loop does not immediately abort when the caller's context is cancelled, leading to unnecessary delays and log spam. Both comments provide actionable code suggestions to check ctx.Err() to resolve these issues.

if errors.As(err, &httpErr) && httpErr.StatusCode == http.StatusNotFound {
// No events feed on this backend — fall through to the
// full-pull bootstrap path. (Pre-fix behaviour.)
} else if isCursorResolutionRetryable(err) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If the caller's context ctx is cancelled (e.g., context.Canceled), isCursorResolutionRetryable(err) returns false. This causes the syncer to fall through to the else block and trigger a heavy full pull/bootstrap in the background (since bootstrapContext derives its context from s.rootCtx rather than ctx). Checking ctx.Err() != nil ensures we abort immediately if the cycle's context is no longer active, preventing unnecessary resource consumption and potential server overload.

Suggested change
} else if isCursorResolutionRetryable(err) {
} else if ctx.Err() != nil || isCursorResolutionRetryable(err) {

return cursor, nil
}
lastErr = err
if !isCursorResolutionRetryable(err) || attempt == attempts-1 {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the caller's context ctx is cancelled or times out during an attempt, resolveLatestEventCursorOnce will return context.Canceled or context.DeadlineExceeded. Since isCursorResolutionRetryable(context.DeadlineExceeded) is true, the retry loop will continue to wait for the retry delay and attempt the next iteration, even though any subsequent attempt is guaranteed to fail immediately. Checking ctx.Err() != nil allows us to abort the retry loop immediately, avoiding useless log spam and unnecessary sleep delays.

Suggested change
if !isCursorResolutionRetryable(err) || attempt == attempts-1 {
if ctx.Err() != nil || !isCursorResolutionRetryable(err) || attempt == attempts-1 {

Copy link
Copy Markdown

@pullfrog pullfrog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ No new issues found.

Reviewed changes — Fixes a production stall loop where transient cursor-resolution timeout on reused sticky boxes collapsed into a full-tree export, which then exceeded the bootstrap watchdog and flush timeout, leaving state non-empty without completed bootstrap.

  • Retry cursor resolution with exponential backoff (internal/mountsync/syncer.go) — resolveLatestEventCursor now delegates to resolveLatestEventCursorOnce in a retry loop (3 attempts, 250ms base delay, 2× per attempt, capped at min(2s, timeout/10)). Each attempt gets its own rootCtx-derived deadline.
  • Classify timeout errors as retryableisCursorResolutionRetryable identifies context.DeadlineExceeded and net.Error timeouts; non-retryable errors (404 missing events feed, etc.) preserve the pre-existing fall-through to full export.
  • Break the full-export loop — In pullRemote, when cursor resolution fails with a retryable error even after exhausting retries, the error is surfaced instead of falling through to full-pull bootstrap. On-disk mirror state is preserved; the next reconcile retries the cheap cursor path.
  • Raise cursor timeout 20s → 60s (cmd/relayfile-cli/main.go, cmd/relayfile-mount/main.go, internal/mountsync/syncer.go) — matches the internal default and both CLI flag defaults.
  • Tests (internal/mountsync/syncer_test.go) — Three new tests cover: (1) persisted cursor reuse skips full pull and cursor resolution entirely, (2) transient deadline triggers one retry then succeeds, (3) persistent deadline exhausts retries and returns the error without falling through to full pull.

Pullfrog  | View workflow run𝕏

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/mountsync/syncer.go">

<violation number="1" location="internal/mountsync/syncer.go:2466">
P2: If the caller's `ctx` is cancelled, the resulting `context.Canceled` error is not matched by `isCursorResolutionRetryable`, causing the code to fall through to the `else` branch and trigger an expensive full pull/bootstrap. Add a `ctx.Err() != nil` guard before the retryable check to return early when the cycle context is already dead, avoiding unnecessary resource consumption.</violation>

<violation number="2" location="internal/mountsync/syncer.go:4040">
P2: If the caller's context `ctx` is cancelled during an attempt, the error from `resolveLatestEventCursorOnce` may still be `context.DeadlineExceeded` (from the `s.rootCtx`-derived timeout), which passes `isCursorResolutionRetryable`. The loop then sleeps via `waitWithContext(s.rootCtx, delay)` which doesn't observe `ctx`, causing unnecessary delays. Check `ctx.Err() != nil` to bail out immediately when the caller's context is done.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

return cursor, nil
}
lastErr = err
if !isCursorResolutionRetryable(err) || attempt == attempts-1 {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: If the caller's context ctx is cancelled during an attempt, the error from resolveLatestEventCursorOnce may still be context.DeadlineExceeded (from the s.rootCtx-derived timeout), which passes isCursorResolutionRetryable. The loop then sleeps via waitWithContext(s.rootCtx, delay) which doesn't observe ctx, causing unnecessary delays. Check ctx.Err() != nil to bail out immediately when the caller's context is done.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/mountsync/syncer.go, line 4040:

<comment>If the caller's context `ctx` is cancelled during an attempt, the error from `resolveLatestEventCursorOnce` may still be `context.DeadlineExceeded` (from the `s.rootCtx`-derived timeout), which passes `isCursorResolutionRetryable`. The loop then sleeps via `waitWithContext(s.rootCtx, delay)` which doesn't observe `ctx`, causing unnecessary delays. Check `ctx.Err() != nil` to bail out immediately when the caller's context is done.</comment>

<file context>
@@ -4013,12 +4026,42 @@ func (s *Syncer) markIncrementalCheckpoint(pageStartCursor, pageCursor, phase, r
+			return cursor, nil
+		}
+		lastErr = err
+		if !isCursorResolutionRetryable(err) || attempt == attempts-1 {
+			return "", err
+		}
</file context>
Suggested change
if !isCursorResolutionRetryable(err) || attempt == attempts-1 {
if ctx.Err() != nil || !isCursorResolutionRetryable(err) || attempt == attempts-1 {

if errors.As(err, &httpErr) && httpErr.StatusCode == http.StatusNotFound {
// No events feed on this backend — fall through to the
// full-pull bootstrap path. (Pre-fix behaviour.)
} else if isCursorResolutionRetryable(err) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: If the caller's ctx is cancelled, the resulting context.Canceled error is not matched by isCursorResolutionRetryable, causing the code to fall through to the else branch and trigger an expensive full pull/bootstrap. Add a ctx.Err() != nil guard before the retryable check to return early when the cycle context is already dead, avoiding unnecessary resource consumption.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/mountsync/syncer.go, line 2466:

<comment>If the caller's `ctx` is cancelled, the resulting `context.Canceled` error is not matched by `isCursorResolutionRetryable`, causing the code to fall through to the `else` branch and trigger an expensive full pull/bootstrap. Add a `ctx.Err() != nil` guard before the retryable check to return early when the cycle context is already dead, avoiding unnecessary resource consumption.</comment>

<file context>
@@ -2459,6 +2463,15 @@ func (s *Syncer) pullRemote(ctx context.Context, conflicted map[string]struct{})
 		if errors.As(err, &httpErr) && httpErr.StatusCode == http.StatusNotFound {
 			// No events feed on this backend — fall through to the
 			// full-pull bootstrap path. (Pre-fix behaviour.)
+		} else if isCursorResolutionRetryable(err) {
+			// A completed prior bootstrap with tracked files is reusable
+			// state. Under load, falling through from a transient cursor
</file context>
Suggested change
} else if isCursorResolutionRetryable(err) {
} else if ctx.Err() != nil || isCursorResolutionRetryable(err) {

@khaliqgant khaliqgant merged commit 048fdbc into main May 30, 2026
9 checks passed
@khaliqgant khaliqgant deleted the fix/daemon-cursor-fast-path branch May 30, 2026 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant