fix(mount): retry transient cursor resolution instead of full export (#1499) by khaliqgant · Pull Request #221 · AgentWorkforce/relayfile

khaliqgant · 2026-05-30T07:10:22Z

Problem (#1499 relayfile warm-degrade stall loop)

On a reused/churned sticky box the restart fast-path resolved the events cursor under a single 20s deadline (resolveLatestEventCursor, internal/mountsync/syncer.go). Under cloud /fs/export API load that resolution hit context deadline exceeded, so the daemon fell through to a full-tree bootstrap export. The full export then:

exceeded the 90s bootstrap watchdog, and
exceeded the cloud per-fire flushTimeoutSeconds:20 → exit 124,

leaving the state dir non-empty without a completed bootstrap. The next box reuse repeated the same expensive path — the sticky bad-state loop. It only escaped when a clear API window happened to let one full pull complete.

Root cause (file:line)

internal/mountsync/syncer.go resolveLatestEventCursor — one attempt, one fixed cursorTimeout (20s) deadline; any timeout returned an error.
internal/mountsync/syncer.go restart fast-path in pullRemote (cursor-resolution-failed branch) — on any error it fell through to the full-pull/export bootstrap, including a transient timeout against already-reusable state (BootstrapComplete + tracked files).

Fix

Make the cursor fast-path tolerant of transient slow-API windows so box reuse re-syncs incrementally instead of full-exporting:

Retry with backoff. resolveLatestEventCursor now retries timeout-class failures (context.DeadlineExceeded / net.Error timeout) with exponential backoff — resolveLatestEventCursorOnce + isCursorResolutionRetryable + cursorResolutionRetryDelay. Each attempt gets its own rootCtx-derived deadline.
Raise the deadline 20s → 60s (defaultCursorTimeout + both --cursor-timeout flags in cmd/relayfile-mount and cmd/relayfile-cli).
Don't collapse reusable state into a full export. When cursor resolution still fails with a retryable timeout, the restart fast-path returns the cycle error instead of running a full export. The next reconcile retries the cheap cursor path; on-disk mirror state is preserved. This breaks the full-export loop.

Non-timeout / 404 (no events feed) errors keep the prior fall-through-to-full-pull behavior — this change is scoped to the transient timeout class.

Tests (`internal/mountsync/syncer_test.go`) — all GREEN

TestPullReusedMountWithPersistedCursorUsesIncrementalOnly — a reused mount with a persisted EventsCursor takes the incremental path only: no full tree pull, no cursor re-resolution.
TestPullRestartFastPathRetriesCursorDeadlineBeforeFullPull — a transient cursor deadline on restart retries and seeds the cursor incrementally — no full pull.
TestPullRestartFastPathTimeoutDoesNotFallBackToFullPull — a persistent cursor deadline exhausts retries and fails the cycle without a full pull.

Full go test ./internal/mountsync/ passes (~103s); go vet + gofmt clean.

Rollout

This is a change to the Go relayfile-mount daemon binary, distributed via GitHub releases (scripts/install.sh → releases/download/${RELAYFILE_VERSION:-latest}). After merge it needs a relayfile release/version bump (e.g. v0.8.4 → v0.8.5) so cloud-warmed sandboxes install the fixed daemon. Cloud must pin the new RELAYFILE_VERSION (or take latest) where it installs relayfile-mount.

🤖 Generated with Claude Code

…(#1499) On a reused/churned sticky box the restart fast-path resolved the events cursor under a single 20s deadline. Under cloud API load that resolution hit `context deadline exceeded`, and the daemon fell through to a full `/fs/export` bootstrap. The full export then exceeded the 90s bootstrap watchdog (and the cloud per-fire flushTimeoutSeconds:20 → exit 124), leaving the state dir non-empty without a completed bootstrap — so the next box reuse repeated the same expensive path. That is the #1499 warm-degrade stall loop. Fix the cursor fast-path to be resumable + tolerant of transient slow-API windows so box reuse re-syncs incrementally rather than full-exporting: - resolveLatestEventCursor now retries timeout-class failures (context.DeadlineExceeded / net.Error timeout) with exponential backoff (resolveLatestEventCursorOnce + isCursorResolutionRetryable + cursorResolutionRetryDelay), each attempt getting its own rootCtx-derived deadline. - Raise the default cursor-resolution deadline 20s → 60s (defaultCursorTimeout + both CLI/daemon --cursor-timeout flags). - When cursor resolution still fails with a retryable timeout, the restart fast-path returns the cycle error instead of collapsing reusable state (BootstrapComplete + tracked files) into a full export. The next reconcile retries the cheap cursor path, breaking the full-export loop; on-disk mirror state is preserved. Tests (internal/mountsync/syncer_test.go): - TestPullReusedMountWithPersistedCursorUsesIncrementalOnly: a reused mount with a persisted EventsCursor takes the incremental path only — no full tree pull, no cursor re-resolution. - TestPullRestartFastPathRetriesCursorDeadlineBeforeFullPull: a transient cursor deadline on restart retries and seeds the cursor incrementally — no full pull. - TestPullRestartFastPathTimeoutDoesNotFallBackToFullPull: a persistent cursor deadline exhausts retries and fails the cycle without a full pull. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-30T07:10:34Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 6b7f7235-0392-41b4-8130-fb38e9987405

📥 Commits

Reviewing files that changed from the base of the PR and between 7b3af75 and 67cc17f.

📒 Files selected for processing (4)

cmd/relayfile-cli/main.go
cmd/relayfile-mount/main.go
internal/mountsync/syncer.go
internal/mountsync/syncer_test.go

📝 Walkthrough

Walkthrough

The PR increases the --cursor-timeout CLI default from 20s to 60s, adds retry logic with exponential backoff to cursor resolution, and prevents restart fast-path from falling back to full bootstrap when cursor resolution fails with retryable errors. Changes span CLI configuration, core syncer logic, and new test coverage.

Changes

Cursor Timeout Increase and Retry Logic

Layer / File(s)	Summary
CLI flag defaults updated `cmd/relayfile-cli/main.go`, `cmd/relayfile-mount/main.go`	The `--cursor-timeout` flag default is increased from 20s to 60s in both CLI tools, and help text is updated to reflect the new default.
Syncer cursor-resolution configuration `internal/mountsync/syncer.go`	Default cursor timeout constant is increased to 60s, new retry attempt and backoff constants are introduced, and `CursorTimeout` field documentation is refreshed to document the new default and environment variable fallback behavior.
Cursor resolution retry wrapper and helpers `internal/mountsync/syncer.go`	`resolveLatestEventCursor` is refactored into a retry wrapper that calls a per-attempt implementation with deadline derived from root context, retrying on deadline/timeout errors with exponential backoff; `isCursorResolutionRetryable` and `cursorResolutionRetryDelay` helpers classify retryable failures and compute backoff delays.
Restart fast-path error handling `internal/mountsync/syncer.go`	`pullRemote` restart fast-path now explicitly returns retryable cursor-resolution errors instead of falling through to full bootstrap, allowing the daemon to retry cursor seeding on the next reconcile cycle.
Test infrastructure and cursor behavior coverage `internal/mountsync/syncer_test.go`	Test fake client gains `latestEventIDHook` callback to control cursor-resolution outcomes per attempt; three new tests verify that persisted cursors skip bootstrap, restart fast-path retries before fallback, and persistent timeouts prevent fallback to full pull.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

AgentWorkforce/relayfile#166: Modifies mountsync event-cursor resolution and CursorTimeout wiring used by restart fast-path, overlapping with this PR's cursor-timeout default and retry changes.
AgentWorkforce/relayfile#196: Changes resolveLatestEventCursor behavior in syncer.go to prefer LatestEventID with paginated fallback, directly related to this PR's retry/timeout updates.
AgentWorkforce/relayfile#92: Modifies restart/fast-path logic around resolving and seeding EventsCursor from events feed, including retry vs fallback decisions.

Poem

🐰 A whisker-twitching ode to timeouts and tries:
Twenty seconds was brief, but now we're wise—
Sixty seconds of patience, with retries that flow,
Exponential backoff makes restarts go slow.
Fast-path returns errors, no bootstrap fallback,
One simple tweak keeps the relay on track! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: fixing a daemon issue by implementing retry logic for transient cursor resolution failures instead of falling back to full exports.
Description check	✅ Passed	The description thoroughly explains the problem, root cause, fix, tests, and rollout requirements, all directly related to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/daemon-cursor-fast-path

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-30T07:12:08Z

Relayfile Eval Review

Run: .relayfile/evals/runs/2026-05-30T07-12-07-547Z-HEAD-provider
Mode: provider
Git SHA: d3694f2

Human Review Cases

No reviewable human-review cases captured Relayfile output.

gemini-code-assist

Code Review

This pull request increases the default cursor timeout from 20 to 60 seconds and introduces a retry mechanism with exponential backoff for transient cursor resolution failures. It also prevents falling back to an expensive full pull when a cursor resolution fails due to a retryable error (such as a timeout). The review comments identify two important issues: first, if the caller's context is cancelled, the syncer could incorrectly fall back to a heavy full pull instead of aborting immediately; second, the retry loop does not immediately abort when the caller's context is cancelled, leading to unnecessary delays and log spam. Both comments provide actionable code suggestions to check ctx.Err() to resolve these issues.

gemini-code-assist · 2026-05-30T07:12:12Z

 		if errors.As(err, &httpErr) && httpErr.StatusCode == http.StatusNotFound {
 			// No events feed on this backend — fall through to the
 			// full-pull bootstrap path. (Pre-fix behaviour.)
+		} else if isCursorResolutionRetryable(err) {


If the caller's context ctx is cancelled (e.g., context.Canceled), isCursorResolutionRetryable(err) returns false. This causes the syncer to fall through to the else block and trigger a heavy full pull/bootstrap in the background (since bootstrapContext derives its context from s.rootCtx rather than ctx). Checking ctx.Err() != nil ensures we abort immediately if the cycle's context is no longer active, preventing unnecessary resource consumption and potential server overload.

Suggested change

} else if isCursorResolutionRetryable(err) {

} else if ctx.Err() != nil || isCursorResolutionRetryable(err) {

gemini-code-assist · 2026-05-30T07:12:12Z

+			return cursor, nil
+		}
+		lastErr = err
+		if !isCursorResolutionRetryable(err) || attempt == attempts-1 {


If the caller's context ctx is cancelled or times out during an attempt, resolveLatestEventCursorOnce will return context.Canceled or context.DeadlineExceeded. Since isCursorResolutionRetryable(context.DeadlineExceeded) is true, the retry loop will continue to wait for the retry delay and attempt the next iteration, even though any subsequent attempt is guaranteed to fail immediately. Checking ctx.Err() != nil allows us to abort the retry loop immediately, avoiding useless log spam and unnecessary sleep delays.

Suggested change

if !isCursorResolutionRetryable(err) || attempt == attempts-1 {

if ctx.Err() != nil || !isCursorResolutionRetryable(err) || attempt == attempts-1 {

pullfrog

✅ No new issues found.

Reviewed changes — Fixes a production stall loop where transient cursor-resolution timeout on reused sticky boxes collapsed into a full-tree export, which then exceeded the bootstrap watchdog and flush timeout, leaving state non-empty without completed bootstrap.

Retry cursor resolution with exponential backoff (internal/mountsync/syncer.go) — resolveLatestEventCursor now delegates to resolveLatestEventCursorOnce in a retry loop (3 attempts, 250ms base delay, 2× per attempt, capped at min(2s, timeout/10)). Each attempt gets its own rootCtx-derived deadline.
Classify timeout errors as retryable — isCursorResolutionRetryable identifies context.DeadlineExceeded and net.Error timeouts; non-retryable errors (404 missing events feed, etc.) preserve the pre-existing fall-through to full export.
Break the full-export loop — In pullRemote, when cursor resolution fails with a retryable error even after exhausting retries, the error is surfaced instead of falling through to full-pull bootstrap. On-disk mirror state is preserved; the next reconcile retries the cheap cursor path.
Raise cursor timeout 20s → 60s (cmd/relayfile-cli/main.go, cmd/relayfile-mount/main.go, internal/mountsync/syncer.go) — matches the internal default and both CLI flag defaults.
Tests (internal/mountsync/syncer_test.go) — Three new tests cover: (1) persisted cursor reuse skips full pull and cursor resolution entirely, (2) transient deadline triggers one retry then succeeds, (3) persistent deadline exhausts retries and returns the error without falling through to full pull.

^{｜ View workflow run ｜ 𝕏}

cubic-dev-ai

2 issues found across 4 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/mountsync/syncer.go">

<violation number="1" location="internal/mountsync/syncer.go:2466">
P2: If the caller's `ctx` is cancelled, the resulting `context.Canceled` error is not matched by `isCursorResolutionRetryable`, causing the code to fall through to the `else` branch and trigger an expensive full pull/bootstrap. Add a `ctx.Err() != nil` guard before the retryable check to return early when the cycle context is already dead, avoiding unnecessary resource consumption.</violation>

<violation number="2" location="internal/mountsync/syncer.go:4040">
P2: If the caller's context `ctx` is cancelled during an attempt, the error from `resolveLatestEventCursorOnce` may still be `context.DeadlineExceeded` (from the `s.rootCtx`-derived timeout), which passes `isCursorResolutionRetryable`. The loop then sleeps via `waitWithContext(s.rootCtx, delay)` which doesn't observe `ctx`, causing unnecessary delays. Check `ctx.Err() != nil` to bail out immediately when the caller's context is done.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-05-30T07:17:28Z

+			return cursor, nil
+		}
+		lastErr = err
+		if !isCursorResolutionRetryable(err) || attempt == attempts-1 {


P2: If the caller's context ctx is cancelled during an attempt, the error from resolveLatestEventCursorOnce may still be context.DeadlineExceeded (from the s.rootCtx-derived timeout), which passes isCursorResolutionRetryable. The loop then sleeps via waitWithContext(s.rootCtx, delay) which doesn't observe ctx, causing unnecessary delays. Check ctx.Err() != nil to bail out immediately when the caller's context is done.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At internal/mountsync/syncer.go, line 4040: <comment>If the caller's context `ctx` is cancelled during an attempt, the error from `resolveLatestEventCursorOnce` may still be `context.DeadlineExceeded` (from the `s.rootCtx`-derived timeout), which passes `isCursorResolutionRetryable`. The loop then sleeps via `waitWithContext(s.rootCtx, delay)` which doesn't observe `ctx`, causing unnecessary delays. Check `ctx.Err() != nil` to bail out immediately when the caller's context is done.</comment> <file context> @@ -4013,12 +4026,42 @@ func (s *Syncer) markIncrementalCheckpoint(pageStartCursor, pageCursor, phase, r + return cursor, nil + } + lastErr = err + if !isCursorResolutionRetryable(err) || attempt == attempts-1 { + return "", err + } </file context>

Suggested change

if !isCursorResolutionRetryable(err) || attempt == attempts-1 {

if ctx.Err() != nil || !isCursorResolutionRetryable(err) || attempt == attempts-1 {

cubic-dev-ai · 2026-05-30T07:17:28Z

 		if errors.As(err, &httpErr) && httpErr.StatusCode == http.StatusNotFound {
 			// No events feed on this backend — fall through to the
 			// full-pull bootstrap path. (Pre-fix behaviour.)
+		} else if isCursorResolutionRetryable(err) {


P2: If the caller's ctx is cancelled, the resulting context.Canceled error is not matched by isCursorResolutionRetryable, causing the code to fall through to the else branch and trigger an expensive full pull/bootstrap. Add a ctx.Err() != nil guard before the retryable check to return early when the cycle context is already dead, avoiding unnecessary resource consumption.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At internal/mountsync/syncer.go, line 2466: <comment>If the caller's `ctx` is cancelled, the resulting `context.Canceled` error is not matched by `isCursorResolutionRetryable`, causing the code to fall through to the `else` branch and trigger an expensive full pull/bootstrap. Add a `ctx.Err() != nil` guard before the retryable check to return early when the cycle context is already dead, avoiding unnecessary resource consumption.</comment> <file context> @@ -2459,6 +2463,15 @@ func (s *Syncer) pullRemote(ctx context.Context, conflicted map[string]struct{}) if errors.As(err, &httpErr) && httpErr.StatusCode == http.StatusNotFound { // No events feed on this backend — fall through to the // full-pull bootstrap path. (Pre-fix behaviour.) + } else if isCursorResolutionRetryable(err) { + // A completed prior bootstrap with tracked files is reusable + // state. Under load, falling through from a transient cursor </file context>

Suggested change

} else if isCursorResolutionRetryable(err) {

} else if ctx.Err() != nil || isCursorResolutionRetryable(err) {

gemini-code-assist Bot reviewed May 30, 2026

View reviewed changes

pullfrog Bot reviewed May 30, 2026

View reviewed changes

cubic-dev-ai Bot reviewed May 30, 2026

View reviewed changes

khaliqgant merged commit 048fdbc into main May 30, 2026
9 checks passed

khaliqgant deleted the fix/daemon-cursor-fast-path branch May 30, 2026 07:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mount): retry transient cursor resolution instead of full export (#1499)#221

fix(mount): retry transient cursor resolution instead of full export (#1499)#221
khaliqgant merged 1 commit into
mainfrom
fix/daemon-cursor-fast-path

khaliqgant commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Uh oh!

gemini-code-assist Bot May 30, 2026

Uh oh!

pullfrog Bot left a comment

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 30, 2026

Uh oh!

cubic-dev-ai Bot May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	} else if isCursorResolutionRetryable(err) {
	} else if ctx.Err() != nil \|\| isCursorResolutionRetryable(err) {

	if !isCursorResolutionRetryable(err) \|\| attempt == attempts-1 {
	if ctx.Err() != nil \|\| !isCursorResolutionRetryable(err) \|\| attempt == attempts-1 {

Conversation

khaliqgant commented May 30, 2026

Problem (#1499 relayfile warm-degrade stall loop)

Root cause (file:line)

Fix

Tests (internal/mountsync/syncer_test.go) — all GREEN

Rollout

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 30, 2026

Relayfile Eval Review

Human Review Cases

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

pullfrog Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tests (`internal/mountsync/syncer_test.go`) — all GREEN

coderabbitai Bot commented May 30, 2026 •

edited

Loading