Skip to content

test: diagnose flaky boxel-cli prefer-local conflict-resolution test#4929

Open
habdelra wants to merge 2 commits into
mainfrom
worktree-fix-realm-sync-prefer-local-flake
Open

test: diagnose flaky boxel-cli prefer-local conflict-resolution test#4929
habdelra wants to merge 2 commits into
mainfrom
worktree-fix-realm-sync-prefer-local-flake

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

Summary

The realm sync (integration) > resolves conflict with --prefer-local: local version wins test failed in run 26250117942 with expected 'export const v = "remote";\n' to contain 'v = "local"' — i.e. after a sync that should have pushed local content over a remote conflict, the remote still held the older value.

Both sides of the conflict use second-precision mtime detection, and the failing window between sync's _atomic POST returning 201 (20:25:00.733) and the subsequent GET reading the remote bytes (20:25:00.782) is only ~50ms. Local code reading (advisory write lock, source cache invalidation, sync push order) doesn't show a reproducible path to a stale read, so this change instruments the failure rather than guessing at a fix.

What this change does

  • Snapshots local + remote content before sync, so a future failure proves whether writeLocalFile / writeRemoteFile actually landed.
  • Wraps the post-sync remote read in a fetchRemoteFileEventually helper that retries for up to 2s. If the assertion passes with pollMs > 0, the failure mode is a brief post-write visibility race; if it times out, the dump confirms the push didn't reach disk.
  • Emits a single-line console.error diagnostic on miss with all relevant snapshots + pollMs / attempts / realmUrl, so the next CI failure has enough signal to attribute root cause without re-running.

Same instrumentation applied to the --prefer-remote variant since it shares the same race-prone shape (concurrent local+remote mutation immediately followed by sync), even though only --prefer-local has flaked so far.

Test plan

  • Boxel CLI Tests pass on CI (no longer flakes on the conflict-resolution case under normal conditions)
  • If it flakes again, the failure log carries the diagnostic line and we can attribute the cause

🤖 Generated with Claude Code

When the prefer-local conflict-resolution assertion fails, the existing
test reported `expected '…remote…' to contain 'v = "local"'` with no
context — no way to tell whether the sync push didn't fire, whether the
local file was stale going in, or whether the push landed and the
subsequent GET read stale bytes.

This change:

- Snapshots local + remote content before sync so a future failure
  proves whether `writeLocalFile` / `writeRemoteFile` actually landed.
- Wraps the post-sync remote read in `fetchRemoteFileEventually`, which
  retries up to 2s. If the assertion passes with `pollMs > 0`, the
  failure mode is a brief post-write visibility race, not a sync-logic
  bug. If it times out, the diagnostic dump confirms the push didn't
  reach disk.
- Emits a single-line diagnostic via `console.error` with every
  relevant content snapshot + retry counters on miss, so the next CI
  failure leaves enough signal to attribute root cause without another
  reproduction run.

Same instrumentation for the prefer-remote variant since it shares the
race-prone shape (concurrent local+remote mutation immediately followed
by sync), even though only the prefer-local side has flaked so far.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR instruments the boxel-cli realm sync integration tests to better diagnose a flaky conflict-resolution case (--prefer-local) by capturing pre/post-sync snapshots and adding a bounded retry for the immediate post-sync remote read.

Changes:

  • Added fetchRemoteFileEventually() helper to poll the remote file for up to 2s for expected content after sync.
  • Captured pre-sync local/remote snapshots for the conflict tests to distinguish upstream write failures from sync/conflict-resolution issues.
  • Added targeted console.error diagnostics on assertion misses for both --prefer-local and --prefer-remote conflict tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +160 to +165
// Read a remote source file, retrying for up to `timeoutMs` if the body
// doesn't yet contain `expectedSubstring`. Returns the final body either way
// — callers assert on it. Built for the conflict-resolution tests where the
// sync push has already returned 201 but the assertion sometimes reads
// stale bytes; if a retry resolves it, the eventual `pollMs` value points
// at a post-write visibility race rather than the sync logic itself.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Code 🤖] Good catch — fixed in e1c4457. The helper now returns retries = attempts - 1 instead, and the diagnostic dump on miss uses retries=... as the unambiguous "needed-to-wait" signal. elapsedMs is still reported but for context only, not as the retry indicator.

Comment on lines +295 to +297
// Poll-retry to distinguish "sync didn't push" from "push landed but a
// brief visibility race made the GET read stale bytes". pollMs > 0 in
// the diagnostic dump points at the latter.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Code 🤖] Good catch — fixed in e1c4457. The helper now returns retries = attempts - 1 instead, and the diagnostic dump on miss uses retries=... as the unambiguous "needed-to-wait" signal. elapsedMs is still reported but for context only, not as the retry indicator.

@habdelra habdelra changed the title boxel-cli: diagnose flaky prefer-local conflict-resolution test test: diagnose flaky boxel-cli prefer-local conflict-resolution test May 21, 2026
`pollMs` was misleading — it measured fetch + sleep wall time even on
first-attempt success, so the original comment claiming "pollMs > 0 ⇒
visibility race" was wrong. The unambiguous signal is whether any retry
fired at all, so the helper now exposes `retries = attempts - 1` and the
miss-path dump uses that. `elapsedMs` is still reported as
context-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants