Extract preview/sync GitHub Actions#4897
Conversation
037a389 to
79657eb
Compare
Observability diff (vs staging)Show diffdiff --git a/tmp/remote-canon.Nq1dRP/dashboards/boxel-status/indexing.json b/tmp/committed-canon.XjD11i/dashboards/boxel-status/indexing.json
index a39cf75..25280b9 100644
--- a/tmp/remote-canon.Nq1dRP/dashboards/boxel-status/indexing.json
+++ b/tmp/committed-canon.XjD11i/dashboards/boxel-status/indexing.json
@@ -69,6 +69,10 @@
"uid": "cef5v5sl9k7i8f"
},
"description": "System-wide operator action: queue a full reindex across every realm. The button disables itself while a `full-reindex` orchestration job is already pending or running. Per-realm reindex moved to the Realms dashboard. Click POSTs with `Authorization: Bearer ${grafana_secret}` (substituted from SSM at apply time, CS-10929).",
+ "fieldConfig": {
+ "defaults": {},
+ "overrides": []
+ },
"gridPos": {
"h": 8,
"w": 24,
(Run: https://github.com/cardstack/boxel/actions/runs/26161560752) |
Grafana previewPreview deployed for 1 dashboard in the staging Grafana.
Dashboards: Preview is torn down automatically when this PR is closed or merged. (Run: https://github.com/cardstack/boxel/actions/runs/26161560825) |
Preview deploymentsHost Test Results 1 files ± 0 1 suites ±0 1h 34m 40s ⏱️ + 3m 50s Results for commit 880fdff. ± Comparison against earlier commit 57d3fe8. Realm Server Test Results 1 files ±0 1 suites ±0 10m 23s ⏱️ -18s Results for commit 880fdff. ± Comparison against earlier commit 57d3fe8. |
d7095f0 to
434ac24
Compare
Node's fetch always reports `TypeError: fetch failed` as `error.message`; the actual transport reason (ECONNRESET, TLS handshake error, undici socket error, ENOTFOUND, GOAWAY, etc.) is stashed on `error.cause` and was being silently dropped by the publish/unpublish error paths. That left the action-demo workflow showing a bare "Error: fetch failed" with no way to distinguish a real network issue from, say, a self-signed cert problem against the published-realm subdomain. Wrap the three swallowed sites: - `publish.ts` `.action()` catch: log `err.cause` separately if present. - `publish.ts` `waitForPublishedRealmReady`: capture cause into the `lastError` string so the readiness-timeout error reports the same thing the polling loop kept hitting. - `unpublish.ts` `unpublishRealm`: embed cause into the `result.error` string the CLI surfaces. This is the diagnostic the action-demo on #4897 needs to figure out why publish hangs at the initial POST despite the server-side mount completing successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The worker's `fatalExit` handler already exists (uncaughtException / unhandledRejection backstop with a finalize-reservation race) — but it reports the error via `log.error(...)` immediately before `process.exit(1)`. `worker-manager.ts` spawns the child with `stdio: ['pipe', 'pipe', 'pipe', 'ipc']`, so the child's stderr is a libuv-async pipe; the final stream chunk gets discarded when the process disappears, and the captured server log shows the child as having silently exited `code=1, signal=null` with no clue why. worker.ts already uses `writeSync(2, ...)` for exactly this reason on the STARTUP / SIGINT / SIGTERM / disconnect stamps (see the comment above the STARTUP block at the top of the file). Apply the same pattern to the three fatal-exit paths: the uncaughtException / unhandledRejection handler, its inner finalize-failed fallback, and the outer startup-error `.catch`. Route each through a new helper that serializes the error with its full stack and walks `error.cause` (where Node fetch / undici / TLS errors stash the real reason). Discovered while debugging the action-demo on #4897 (CS-11180): every `_publish-realm` of a fresh source realm enqueues a copy-index job that throws *something* inside the worker; the worker exited silently; pg-queue retried, hit the 2-reservation cap, abandoned the job; the realm-server returned HTTP 500 `Job abandoned after 2 failed attempts (max=2)` to the publish endpoint caller. Without this fix the underlying job-processing error is unobservable. The bundled `serialize-fatal-reason` helper is in its own module because the FD-level write behavior can't be unit-tested in-process (it requires a real child_process.spawn + libuv-piped stderr to reproduce the bug being fixed) — but the serialization can. Tests cover: stack preservation, cause-chain walking, non-Error values, self-referential cause cycles (depth-capped), and Node fetch's typical `TypeError: fetch failed` + ECONNRESET-on-cause shape. Closes CS-11200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Node's fetch always reports `TypeError: fetch failed` as `error.message`; the actual transport reason (ECONNRESET, TLS handshake error, undici socket error, ENOTFOUND, GOAWAY, etc.) is stashed on `error.cause` and was being silently dropped by the publish/unpublish error paths. That left the action-demo workflow showing a bare "Error: fetch failed" with no way to distinguish a real network issue from, say, a self-signed cert problem against the published-realm subdomain. Wrap the three swallowed sites: - `publish.ts` `.action()` catch: log `err.cause` separately if present. - `publish.ts` `waitForPublishedRealmReady`: capture cause into the `lastError` string so the readiness-timeout error reports the same thing the polling loop kept hitting. - `unpublish.ts` `unpublishRealm`: embed cause into the `result.error` string the CLI surfaces. This is the diagnostic the action-demo on #4897 needs to figure out why publish hangs at the initial POST despite the server-side mount completing successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f8a1399 to
608717a
Compare
Extract the publish-preview-realm / unpublish-preview-realm / workspace-sync composite actions so `boxel-catalog`, `boxel-home`, `boxel-skills` (and any future consumer) can stop maintaining duplicated bespoke preview-realm workflows. This branch is layered on top of cs-11161 (#4851) so the bundled demo workflow can exercise `boxel realm publish` / `unpublish` / `push` end-to-end against the CLI commits in this branch's ancestry. Once #4851 lands, GitHub will auto-rebase this PR's base onto main and the diff will stay clean against main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Used while iterating on the three composite actions; not part of the shipped product. External consumers (boxel-catalog, boxel-home, boxel-skills) exercise the actions in their own preview workflows.
608717a to
cb1d9db
Compare
Adds preview-realm-actions-integration.yml — runs the three composite actions (publish, workspace-sync, unpublish) end-to-end against the same local matrix + realm-server stack `boxel-cli-test` boots, so contract drift between the actions, the boxel-cli commands they wrap, and the realm-server handlers they POST to is caught the moment any side changes. Path-gated triggers (on `pull_request` and `push` to main, plus `workflow_dispatch` for manual) so only PRs touching the integration surface pay the runtime cost. The set covers each action.yml, this workflow, the publish/unpublish/push CLI commands, the handle-publish-realm / handle-unpublish-realm server handlers, and the copy-index task that the publish handler enqueues. Uses path-relative `uses: ./.github/actions/...` so the actions run at the PR's own commit. External consumers (boxel-catalog, -home, -skills) pin a SHA instead. Also re-applies the in-repo `mise` short-circuit in each action: when `github.action_repository == github.repository` (i.e., invoked from inside cardstack/boxel itself), set BOXEL_SRC to $GITHUB_WORKSPACE and skip the clone + mise/pnpm install steps because the calling workflow's ./.github/actions/init already did them. Without this the inner `jdx/mise-action` re-hashes a separate cache key whose lookup sits ~30 minutes before transfer. External consumers continue to go through the full clone + install path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The in-repo short-circuit compared `github.action_repository` against `github.repository`, but `github.action_repository` is only populated for *external* `uses: org/repo/...@ref` references. For path-relative `uses: ./.github/...` (which is exactly how preview-realm-actions-integration.yml invokes these actions), the value is empty, so the predicate `"" = "cardstack/boxel"` was false and the action fell into the external-consumer branch and tried to `git clone https://github.com/.git/`, failing with `remote: Not Found`. Treat empty BOXEL_REPO as in-repo too. External consumers still hit the populated-and-different branch and run the full clone + install.
I noticed that
boxel-homePR previews are broken:This is because the interface to
_publish-realmchanged:HTTP 202 is actually expected now!
I also noticed that
boxel-catalog,boxel-home, andboxel-skillswere all using duplicative bespoke workflows to accomplish similar tasks, with use ofcardstack/boxel-cli, npm Boxel CLI, and the old workspace sync CLI.This extracts the preview/sync workflows into the monorepo so they can be used from external repositories and tested in-monorepo in case of interface changes like the above.