Skip to content

worker: write fatal-error diagnostics syncly so stderr isn't dropped#4906

Draft
backspace wants to merge 3 commits into
mainfrom
cs-11200-worker-fatal-logging
Draft

worker: write fatal-error diagnostics syncly so stderr isn't dropped#4906
backspace wants to merge 3 commits into
mainfrom
cs-11200-worker-fatal-logging

Conversation

@backspace
Copy link
Copy Markdown
Contributor

@backspace backspace commented May 20, 2026

Another problem I encountered while working on #4897, explained by Claude:

The worker's fatal-error handler logged via log.error(...) immediately before process.exit(1). Because worker-manager spawns the worker with piped stderr and Node's stream-layer write is libuv-async, the final log lines were dropped before the parent's worker.stderr.on('data') reader saw them, so children appeared to die silently as code=1, signal=null. This swaps the three exit paths to use writeSync(2, …) — the FD-level syscall, same pattern the file's STARTUP/SIGINT stamps already use — routed through a serializeFatalReason() helper that walks error.cause. So crashes now produce [worker] FATAL …: Caused by: … in the captured log instead of vanishing.

For example, here:

Error: Publish failed: HTTP 500: {"errors":["Job abandoned after 2 failed attempts (max=2)"]}
Error: Process completed with exit code 1.

The worker's `fatalExit` handler already exists (uncaughtException /
unhandledRejection backstop with a finalize-reservation race) — but
it reports the error via `log.error(...)` immediately before
`process.exit(1)`. `worker-manager.ts` spawns the child with
`stdio: ['pipe', 'pipe', 'pipe', 'ipc']`, so the child's stderr is a
libuv-async pipe; the final stream chunk gets discarded when the
process disappears, and the captured server log shows the child as
having silently exited `code=1, signal=null` with no clue why.

worker.ts already uses `writeSync(2, ...)` for exactly this reason
on the STARTUP / SIGINT / SIGTERM / disconnect stamps (see the
comment above the STARTUP block at the top of the file). Apply the
same pattern to the three fatal-exit paths: the uncaughtException /
unhandledRejection handler, its inner finalize-failed fallback, and
the outer startup-error `.catch`. Route each through a new helper
that serializes the error with its full stack and walks `error.cause`
(where Node fetch / undici / TLS errors stash the real reason).

Discovered while debugging the action-demo on #4897 (CS-11180): every
`_publish-realm` of a fresh source realm enqueues a copy-index job
that throws *something* inside the worker; the worker exited
silently; pg-queue retried, hit the 2-reservation cap, abandoned the
job; the realm-server returned HTTP 500
`Job abandoned after 2 failed attempts (max=2)` to the publish
endpoint caller. Without this fix the underlying job-processing
error is unobservable.

The bundled `serialize-fatal-reason` helper is in its own module
because the FD-level write behavior can't be unit-tested in-process
(it requires a real child_process.spawn + libuv-piped stderr to
reproduce the bug being fixed) — but the serialization can. Tests
cover: stack preservation, cause-chain walking, non-Error values,
self-referential cause cycles (depth-capped), and Node fetch's
typical `TypeError: fetch failed` + ECONNRESET-on-cause shape.

Closes CS-11200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@backspace backspace marked this pull request as draft May 20, 2026 19:18
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ef1f358e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/realm-server/lib/serialize-fatal-reason.ts Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

Host Test Results

    1 files  ±    0      1 suites  ±0   1h 34m 43s ⏱️ + 52m 17s
2 712 tests +1 252  2 697 ✅ +1 251  15 💤 +4  0 ❌  - 2 
2 731 runs  +1 259  2 716 ✅ +1 259  15 💤 +4  0 ❌  - 3 

Results for commit 6e89715. ± Comparison against earlier commit 82d4e66.

Realm Server Test Results

    1 files  ±  0      1 suites  ±0   11m 30s ⏱️ + 6m 1s
1 489 tests +614  1 489 ✅ +614  0 💤 ±0  0 ❌ ±0 
1 580 runs  +677  1 580 ✅ +677  0 💤 ±0  0 ❌ ±0 

Results for commit 6e89715. ± Comparison against earlier commit 82d4e66.

backspace and others added 2 commits May 21, 2026 14:40
Per code review on #4906 (P1): `String(Object.create(null))` throws
`TypeError: Cannot convert object to primitive value` because the
prototype-less object has neither `@@toPrimitive`, `toString`, nor
`valueOf` for `OrdinaryToPrimitive` to call. Any object whose own
`toString` throws hits the same path. `unhandledRejection` reasons
are `any` and libraries do occasionally reject with such values.

If `serializeFatalReason` had thrown inside `fatalExit` after
`isFatalHandlerRunning = true`, the throw would have become an
uncaughtException, re-entered `fatalExit`, hit the early-return
guard, and returned silently — stranding the worker with its
pg-queue reservation held and no FATAL stamp written. The fatal
path cannot tolerate that.

Two layers of defense:

1. `serialize-fatal-reason.ts`: wrap the entire body in try/catch
   and route every `String(...)` call through a `safeString` helper
   that catches and returns `[unstringifiable value]`.

2. `worker.ts`: wrap each `serializeFatalReason` call at its three
   call sites (uncaughtException/unhandledRejection handler, the
   finalize-failed fallback, and the outer startup-error catch) in
   a `safeSerialize` helper. The helper's own try/catch is the
   primary defense; the call-site wrap is belt-and-suspenders
   against a future regression in the helper.

Tests added for the three triggering shapes: a prototype-less
rejection value, a value whose own `toString` throws, and an Error
whose `cause` has a hostile `toString`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prettier collapsed two ternary/call-expression forms onto a single
line where the result fits the width budget. No semantic changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant