test: diagnose flaky boxel-cli prefer-local conflict-resolution test#4929
test: diagnose flaky boxel-cli prefer-local conflict-resolution test#4929habdelra wants to merge 2 commits into
Conversation
When the prefer-local conflict-resolution assertion fails, the existing test reported `expected '…remote…' to contain 'v = "local"'` with no context — no way to tell whether the sync push didn't fire, whether the local file was stale going in, or whether the push landed and the subsequent GET read stale bytes. This change: - Snapshots local + remote content before sync so a future failure proves whether `writeLocalFile` / `writeRemoteFile` actually landed. - Wraps the post-sync remote read in `fetchRemoteFileEventually`, which retries up to 2s. If the assertion passes with `pollMs > 0`, the failure mode is a brief post-write visibility race, not a sync-logic bug. If it times out, the diagnostic dump confirms the push didn't reach disk. - Emits a single-line diagnostic via `console.error` with every relevant content snapshot + retry counters on miss, so the next CI failure leaves enough signal to attribute root cause without another reproduction run. Same instrumentation for the prefer-remote variant since it shares the race-prone shape (concurrent local+remote mutation immediately followed by sync), even though only the prefer-local side has flaked so far. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR instruments the boxel-cli realm sync integration tests to better diagnose a flaky conflict-resolution case (--prefer-local) by capturing pre/post-sync snapshots and adding a bounded retry for the immediate post-sync remote read.
Changes:
- Added
fetchRemoteFileEventually()helper to poll the remote file for up to 2s for expected content after sync. - Captured pre-sync local/remote snapshots for the conflict tests to distinguish upstream write failures from sync/conflict-resolution issues.
- Added targeted
console.errordiagnostics on assertion misses for both--prefer-localand--prefer-remoteconflict tests.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Read a remote source file, retrying for up to `timeoutMs` if the body | ||
| // doesn't yet contain `expectedSubstring`. Returns the final body either way | ||
| // — callers assert on it. Built for the conflict-resolution tests where the | ||
| // sync push has already returned 201 but the assertion sometimes reads | ||
| // stale bytes; if a retry resolves it, the eventual `pollMs` value points | ||
| // at a post-write visibility race rather than the sync logic itself. |
There was a problem hiding this comment.
[Claude Code 🤖] Good catch — fixed in e1c4457. The helper now returns retries = attempts - 1 instead, and the diagnostic dump on miss uses retries=... as the unambiguous "needed-to-wait" signal. elapsedMs is still reported but for context only, not as the retry indicator.
| // Poll-retry to distinguish "sync didn't push" from "push landed but a | ||
| // brief visibility race made the GET read stale bytes". pollMs > 0 in | ||
| // the diagnostic dump points at the latter. |
There was a problem hiding this comment.
[Claude Code 🤖] Good catch — fixed in e1c4457. The helper now returns retries = attempts - 1 instead, and the diagnostic dump on miss uses retries=... as the unambiguous "needed-to-wait" signal. elapsedMs is still reported but for context only, not as the retry indicator.
`pollMs` was misleading — it measured fetch + sleep wall time even on first-attempt success, so the original comment claiming "pollMs > 0 ⇒ visibility race" was wrong. The unambiguous signal is whether any retry fired at all, so the helper now exposes `retries = attempts - 1` and the miss-path dump uses that. `elapsedMs` is still reported as context-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
The
realm sync (integration) > resolves conflict with --prefer-local: local version winstest failed in run 26250117942 withexpected 'export const v = "remote";\n' to contain 'v = "local"'— i.e. after a sync that should have pushed local content over a remote conflict, the remote still held the older value.Both sides of the conflict use second-precision mtime detection, and the failing window between sync's
_atomicPOST returning 201 (20:25:00.733) and the subsequent GET reading the remote bytes (20:25:00.782) is only ~50ms. Local code reading (advisory write lock, source cache invalidation, sync push order) doesn't show a reproducible path to a stale read, so this change instruments the failure rather than guessing at a fix.What this change does
writeLocalFile/writeRemoteFileactually landed.fetchRemoteFileEventuallyhelper that retries for up to 2s. If the assertion passes withpollMs > 0, the failure mode is a brief post-write visibility race; if it times out, the dump confirms the push didn't reach disk.console.errordiagnostic on miss with all relevant snapshots +pollMs/attempts/realmUrl, so the next CI failure has enough signal to attribute root cause without re-running.Same instrumentation applied to the
--prefer-remotevariant since it shares the same race-prone shape (concurrent local+remote mutation immediately followed by sync), even though only--prefer-localhas flaked so far.Test plan
🤖 Generated with Claude Code