chore: drop internal sentry refs from server-changes entry

d-cs · d-cs · commit 993fa7041334 · 2026-05-05T10:19:27.000+01:00
diff --git a/_plans/2026-05-01-sentry-wrapper-bypass-design.md b/_plans/2026-05-01-sentry-wrapper-bypass-design.md
@@ -0,0 +1,311 @@
+# Sentry wrapper-bypass log-level fix
+
+**Date:** 2026-05-01
+**Status:** Design — not committed; lives in `_plans/` for handoff to implementation.
+**Related:** [`dac9c83bd`](https://github.com/triggerdotdev/trigger.dev/commit/dac9c83bd) (chore: downgrade boundary log noise to warn)
+
+## Problem
+
+`dac9c83bd` added a Sentry SDK-level filter — `ignoreErrors: /^ServiceValidationError(?::|$)/` in `apps/webapp/sentry.server.ts` — that drops `ServiceValidationError` events before they reach Sentry. The filter only matches when the captured event's *type* is `ServiceValidationError`; it relies on the SVE propagating as an exception.
+
+Nine call sites in the webapp catch `ServiceValidationError` (and analogous user-input error types like `OutOfEntitlementError`, `CreateDeclarativeScheduleError`, `QueryError`), then call `logger.error("wrapper message", { error: e })` *before* performing the type discrimination. That `logger.error` becomes a Sentry event titled with the wrapper message, with the inner error buried in `extra.error` — invisible to the SDK filter.
+
+Visible impact in production today:
+
+- [`TRIGGER-CLOUD-2S`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-2S) "Failed to create background worker" — 13,013 lifetime occurrences, still firing in `19c16759f`.
+- [`TRIGGER-CLOUD-38`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-38) "Batch trigger error" — 755 events in last 48h, still firing.
+- [`TRIGGER-CLOUD-103`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-103) "Stream batch items error" — 1,052 events in last 48h, still firing.
+
+In each case the underlying cause is a `ServiceValidationError` that should never have escalated.
+
+## Goals
+
+- Stop the wrapper-bypass groups (`2S`, `38`, `103` and the smaller related issues) from escalating expected user-input failures to Sentry.
+- Preserve `error`-level visibility for genuinely unknown errors at the same call sites.
+- Match the style and voice of `dac9c83bd` so the change reads as a continuation of that PR's work.
+
+## Non-goals
+
+- The "wrap unrelated errors into `ServiceValidationError`" pattern in the two service files. We address its log-level symptom here (log inner at `error` before wrapping) but do not redesign the wrapping itself. Captured as a separate brainstorm.
+- The single-task trigger endpoint's silent 500s (`api.v1.tasks.$taskId.trigger.ts`). Inverse problem — real bugs are being swallowed. Out of scope; separate brainstorm.
+- Extending the Sentry SDK `ignoreErrors` filter. The downgrade to `logger.warn` is sufficient — `warn` does not reach Sentry regardless of the inner error type.
+- Refactoring catch blocks beyond what's strictly needed for the log-level change. Side cleanups limited to converting `JSON.stringify(e)` to `error: e` where it appears (matches `dac9c83bd`'s `error: e` style and the v2 file's existing form).
+
+## Affected call sites
+
+Nine sites total. Confirmed by audit (broader grep covering all `XxxError` types, plus directory sweep across `apps/webapp/app`, `apps/supervisor`, and `internal-packages`).
+
+| # | File:Line | Wrapper message | Expected types caught | Unknown-error fallback | Maps to Sentry issue |
+|---|---|---|---|---|---|
+| 1 | `apps/webapp/app/routes/api.v1.projects.$projectRef.background-workers.ts:60` | "Failed to create background worker" | `ServiceValidationError`, `CreateDeclarativeScheduleError` (both → 400) | 500 | [`2S`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-2S) |
+| 2 | `apps/webapp/app/routes/api.v1.deployments.$deploymentId.background-workers.ts:62` | "Failed to create background worker" | same | 500 | [`2S`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-2S) |
+| 3 | `apps/webapp/app/routes/api.v1.tasks.batch.ts:130` | "Batch trigger error" | `ServiceValidationError`, `OutOfEntitlementError` (both → 422) | 500 with `x-should-retry: false` | [`38`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-38) |
+| 4 | `apps/webapp/app/routes/api.v2.tasks.batch.ts:147` | "Batch trigger error" | same | 500 with `x-should-retry: false` | [`38`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-38) |
+| 5 | `apps/webapp/app/routes/api.v3.batches.ts:175` | "Create batch error" | `ServiceValidationError`, `OutOfEntitlementError` (both → 422) | 500 with `x-should-retry: false` | not currently in Sentry |
+| 6 | `apps/webapp/app/v3/services/createBackgroundWorker.server.ts:149` | "Error syncing declarative schedules" | `ServiceValidationError` (rethrown) | wraps unknown into new `ServiceValidationError`, throws | not currently in Sentry |
+| 7 | `apps/webapp/app/v3/services/createDeploymentBackgroundWorkerV4.server.ts:142` | "Error syncing declarative schedules" | same | wraps unknown into new `ServiceValidationError`, throws | not currently in Sentry |
+| 8 | `apps/webapp/app/routes/api.v3.batches.$batchId.items.ts:91` | "Stream batch items error" | `ServiceValidationError` → 422; `Error` with `Invalid JSON` → 400 | 500 | [`103`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-103) |
+| 9 | `apps/webapp/app/routes/api.v1.query.ts:69` | "Query API error" | `QueryError` → 400 | 500 | not currently in Sentry |
+
+## Approach
+
+Inline at each site; no shared helper. Three variants distinguished by call-site shape.
+
+### Variant 1 — route catches that return (sites 1, 2, 3, 4, 5, 8)
+
+Move `logger.error` after the `instanceof` chain. Only the unknown-error fall-through invokes it. Convert `else if` chains to early returns for readability. Add a one-line comment in the `dac9c83bd` voice on the catch block.
+
+Before (site 1):
+
+```ts
+} catch (e) {
+  logger.error("Failed to create background worker", { error: JSON.stringify(e) });
+
+  if (e instanceof ServiceValidationError) {
+    return json({ error: e.message }, { status: 400 });
+  } else if (e instanceof CreateDeclarativeScheduleError) {
+    return json({ error: e.message }, { status: 400 });
+  }
+
+  return json({ error: "Failed to create background worker" }, { status: 500 });
+}
+```
+
+After:
+
+```ts
+} catch (e) {
+  // Customer-facing validation failures (invalid task config, customer cron
+  // expression, etc.). The handler returns 4xx with the message; system
+  // handles it gracefully, no alert needed.
+  if (e instanceof ServiceValidationError) {
+    return json({ error: e.message }, { status: 400 });
+  }
+  if (e instanceof CreateDeclarativeScheduleError) {
+    return json({ error: e.message }, { status: 400 });
+  }
+
+  logger.error("Failed to create background worker", { error: e });
+
+  return json({ error: "Failed to create background worker" }, { status: 500 });
+}
+```
+
+Per-site differences:
+
+- Sites 3, 4, 5: type list is `[ServiceValidationError, OutOfEntitlementError]`, status `422`, response includes the `x-should-retry: false` header on the unknown path.
+- Site 8 keeps the `Invalid JSON` substring branch (returns 400) but moves it under the unknown-error path, after the `logger.error` call but before the generic 500 — that branch represents a parse failure on customer input and should also be a `warn`. Concrete refactor:
+
+  ```ts
+  } catch (error) {
+    if (error instanceof ServiceValidationError) {
+      logger.warn("Stream batch items error", { batchId, error: error.message });
+      return json({ error: error.message }, { status: 422 });
+    }
+
+    if (error instanceof Error && error.message.includes("Invalid JSON")) {
+      // Customer-supplied stream isn't valid JSON; surface as 400.
+      logger.warn("Stream batch items error: invalid JSON", { batchId, error: error.message });
+      return json({ error: error.message }, { status: 400 });
+    }
+
+    logger.error("Stream batch items error", {
+      batchId,
+      error: { message: (error as Error).message, stack: (error as Error).stack },
+    });
+
+    return json({ error: (error as Error).message ?? "Something went wrong" }, { status: 500 });
+  }
+  ```
+
+### Variant 2 — services that wrap into SVE (sites 6, 7)
+
+Split the SVE branch out before the `logger.error`. Keep `logger.error` for the wrap-real-bug path so visibility survives the SDK filter — same reasoning as `dac9c83bd`'s `waitpointCompletionPacket.server.ts` change.
+
+Before (site 6, schedules branch):
+
+```ts
+if (schedulesError) {
+  logger.error("Error syncing declarative schedules", {
+    error: schedulesError,
+    backgroundWorker,
+    environment,
+  });
+
+  if (schedulesError instanceof ServiceValidationError) {
+    throw schedulesError;
+  }
+
+  throw new ServiceValidationError("Error syncing declarative schedules");
+}
+```
+
+After:
+
+```ts
+if (schedulesError) {
+  if (schedulesError instanceof ServiceValidationError) {
+    // Customer schedule config (typically invalid cron). Surface to client
+    // via the rethrow; system returns gracefully.
+    logger.warn("Error syncing declarative schedules", {
+      error: schedulesError.message,
+      backgroundWorker,
+      environment,
+    });
+    throw schedulesError;
+  }
+
+  // Wrapping the underlying error into a ServiceValidationError below would
+  // otherwise hide it once the SDK-level filter drops SVEs; log at error so
+  // the underlying cause stays visible. Mirrors the
+  // waitpointCompletionPacket.server.ts pattern from dac9c83bd.
+  logger.error("Error syncing declarative schedules", {
+    error: schedulesError,
+    backgroundWorker,
+    environment,
+  });
+
+  throw new ServiceValidationError("Error syncing declarative schedules");
+}
+```
+
+Site 7 has the identical shape. Both files also have earlier `filesError` and `resourcesError` catches that wrap into `ServiceValidationError` unconditionally (no `instanceof` check). Those are already correct under this design — they always log the inner error at `error` before wrapping a real bug into an SVE — so no change is needed there.
+
+### Variant 3 — site 9, different error type, same fix
+
+`api.v1.query.ts` catches `QueryError` (a customer SQL error). The error type isn't covered by the SDK filter, but that doesn't matter — `logger.warn` doesn't reach Sentry regardless. Same downgrade pattern.
+
+Before:
+
+```ts
+if (!queryResult.success) {
+  const message =
+    queryResult.error instanceof QueryError
+      ? queryResult.error.message
+      : "An unexpected error occurred while executing the query.";
+
+  logger.error("Query API error", { error: queryResult.error, query });
+
+  return json(
+    { error: message },
+    { status: queryResult.error instanceof QueryError ? 400 : 500 }
+  );
+}
+```
+
+After:
+
+```ts
+if (!queryResult.success) {
+  if (queryResult.error instanceof QueryError) {
+    // Customer SQL is invalid or unsupported. Returned to caller as 400.
+    logger.warn("Query API error", { error: queryResult.error.message, query });
+    return json({ error: queryResult.error.message }, { status: 400 });
+  }
+
+  logger.error("Query API error", { error: queryResult.error, query });
+  return json(
+    { error: "An unexpected error occurred while executing the query." },
+    { status: 500 }
+  );
+}
+```
+
+### Comment style
+
+Three principles, taken from `dac9c83bd`:
+
+1. **Lead with what the failure means in business terms** — *"Customer-facing validation failures …"*, not *"This is a ServiceValidationError"*.
+2. **State why the system is fine** — *"Handler returns 4xx; system handles it gracefully, no alert needed."*
+3. **Where wrapping happens, name the visibility risk explicitly** — see Variant 2's second comment, mirroring the `waitpointCompletionPacket.server.ts` callout from the original PR.
+
+Comments are 1–3 lines, indicative voice, no headers.
+
+## Tests
+
+Vitest, flat under `apps/webapp/test/` matching the existing webapp convention (`vitest.config.ts` only picks up `test/**/*.test.ts`, so co-located tests would be silently skipped). Each file gets ~3 tests.
+
+Test shape:
+
+```ts
+import { describe, it, expect, vi, beforeEach } from "vitest";
+
+vi.mock("~/services/logger.server", () => ({
+  logger: { warn: vi.fn(), error: vi.fn(), info: vi.fn(), debug: vi.fn() },
+}));
+
+// Mock the underlying service so we can throw whatever we want.
+
+describe("api.v1.projects.$projectRef.background-workers error handling", () => {
+  beforeEach(() => vi.clearAllMocks());
+
+  it("logs ServiceValidationError at warn and returns 400", async () => {
+    // Arrange the service to throw new ServiceValidationError("bad cron").
+    // Invoke the action.
+    // Assert: logger.warn called once with the wrapper message.
+    //         logger.error not called.
+    //         response.status === 400, body.error === "bad cron".
+  });
+
+  it("logs CreateDeclarativeScheduleError at warn and returns 400", async () => { /* ... */ });
+
+  it("logs unknown errors at error and returns 500", async () => { /* ... */ });
+});
+```
+
+What we explicitly do NOT test:
+
+- The Sentry SDK integration. The premise *"`logger.warn` does not reach Sentry"* is verified once at the SDK level; tests assert on the logger contract only.
+- The full happy path / business logic of each route. Out of scope; would balloon test scope without adding signal on this change.
+
+Per-site test counts:
+
+- Sites 1, 2 (route catches with two expected types): 3 tests each.
+- Sites 3, 4, 5, 8 (route catches with two expected types or extra `Invalid JSON` branch): 3 tests each.
+- Site 9 (route catch with one expected type): 2 tests.
+- Sites 6, 7 (services with wrap-into-SVE on `schedulesError`): 2 tests each — SVE rethrow with `warn`, non-SVE wrap with `error`.
+
+Approximate total: ~24 tests across 9 files.
+
+Authentication scaffolding for action invocations can be a thin shared helper inside a `__test_helpers__` module if duplication emerges; the implementation plan decides this based on what the first 1–2 files need. Not pinned by the spec.
+
+## Rollout
+
+Single PR. Test plan in the PR description mirrors `dac9c83bd`'s structure:
+
+- [ ] `pnpm run typecheck --filter webapp`
+- [ ] `pnpm run test --filter webapp` covering the new `*.test.ts` files
+- [ ] Manual smoke on `references/hello-world`: trigger a task with a deliberately invalid cron schedule, verify 4xx response and `warn` log; no Sentry event. Repeat with valid config to confirm happy path is unaffected.
+- [ ] Add `.server-changes/sentry-wrapper-bypass-fix.md` per the project rule (`server-apps.md`). High-level summary, no design detail.
+
+## Risk
+
+The change is purely log-level + ordering; behaviour-preserving for users (same status codes, same response bodies). Failure modes:
+
+- **Over-quieting** — a previously `error`-logged real bug accidentally demoted to `warn`. Mitigated by tests on the unknown-error branch at every site.
+- **Under-quieting** — a site missed by the audit continues to escalate. Mitigated by the post-deploy verification metric below.
+
+## Verification
+
+Defined success metric, included in the PR description:
+
+48 hours after merge, Sentry event volume on the wrapper-bypass groups should drop to near-zero in the deployed release. Specifically:
+
+- [`TRIGGER-CLOUD-2S`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-2S) — was 13k lifetime, ~158 in last 48h.
+- [`TRIGGER-CLOUD-38`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-38) — 755 in last 48h.
+- [`TRIGGER-CLOUD-103`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-103) — 1,052 in last 48h.
+
+Any of those three still firing in the new release indicates a missed call site or a different error type slipping through; investigate and patch as a follow-up.
+
+## Known follow-ups
+
+Captured for separate brainstorms:
+
+1. **Wrap-into-SVE pattern more broadly.** Services that wrap unrelated errors into `ServiceValidationError` create a permanent visibility risk: even with this fix, real bugs wrapped this way only stay visible because we explicitly log at error before the wrap. Replacing the wrap with a distinct `InternalServiceError` (or simply not wrapping) would let the SDK filter and route-level handlers treat the two cases correctly without per-site care.
+2. **`api.v1.tasks.$taskId.trigger.ts` silent 500s.** The single-task trigger endpoint catches and returns 500 with `error.message` but never calls `logger.error`. Real bugs on the highest-volume endpoint in the system are not surfacing in Sentry. Inverse problem to this design.
+
+## References
+
+- `dac9c83bd` — chore(webapp,run-engine): downgrade boundary log noise to warn (#3462). Source for style, voice, and the `waitpointCompletionPacket.server.ts` "log inner before wrap" pattern reused here.
+- `apps/webapp/sentry.server.ts` — current Sentry SDK config including `ignoreErrors` filter.
+- Investigation Notion doc — *Sentry Triage May 2026 / Investigation: Prisma P1001 RDS Connectivity* (parent of this design's reasoning thread).