|
| 1 | +# Sentry wrapper-bypass log-level fix |
| 2 | + |
| 3 | +**Date:** 2026-05-01 |
| 4 | +**Status:** Design — not committed; lives in `_plans/` for handoff to implementation. |
| 5 | +**Related:** [`dac9c83bd`](https://github.com/triggerdotdev/trigger.dev/commit/dac9c83bd) (chore: downgrade boundary log noise to warn) |
| 6 | + |
| 7 | +## Problem |
| 8 | + |
| 9 | +`dac9c83bd` added a Sentry SDK-level filter — `ignoreErrors: /^ServiceValidationError(?::|$)/` in `apps/webapp/sentry.server.ts` — that drops `ServiceValidationError` events before they reach Sentry. The filter only matches when the captured event's *type* is `ServiceValidationError`; it relies on the SVE propagating as an exception. |
| 10 | + |
| 11 | +Nine call sites in the webapp catch `ServiceValidationError` (and analogous user-input error types like `OutOfEntitlementError`, `CreateDeclarativeScheduleError`, `QueryError`), then call `logger.error("wrapper message", { error: e })` *before* performing the type discrimination. That `logger.error` becomes a Sentry event titled with the wrapper message, with the inner error buried in `extra.error` — invisible to the SDK filter. |
| 12 | + |
| 13 | +Visible impact in production today: |
| 14 | + |
| 15 | +- [`TRIGGER-CLOUD-2S`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-2S) "Failed to create background worker" — 13,013 lifetime occurrences, still firing in `19c16759f`. |
| 16 | +- [`TRIGGER-CLOUD-38`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-38) "Batch trigger error" — 755 events in last 48h, still firing. |
| 17 | +- [`TRIGGER-CLOUD-103`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-103) "Stream batch items error" — 1,052 events in last 48h, still firing. |
| 18 | + |
| 19 | +In each case the underlying cause is a `ServiceValidationError` that should never have escalated. |
| 20 | + |
| 21 | +## Goals |
| 22 | + |
| 23 | +- Stop the wrapper-bypass groups (`2S`, `38`, `103` and the smaller related issues) from escalating expected user-input failures to Sentry. |
| 24 | +- Preserve `error`-level visibility for genuinely unknown errors at the same call sites. |
| 25 | +- Match the style and voice of `dac9c83bd` so the change reads as a continuation of that PR's work. |
| 26 | + |
| 27 | +## Non-goals |
| 28 | + |
| 29 | +- The "wrap unrelated errors into `ServiceValidationError`" pattern in the two service files. We address its log-level symptom here (log inner at `error` before wrapping) but do not redesign the wrapping itself. Captured as a separate brainstorm. |
| 30 | +- The single-task trigger endpoint's silent 500s (`api.v1.tasks.$taskId.trigger.ts`). Inverse problem — real bugs are being swallowed. Out of scope; separate brainstorm. |
| 31 | +- Extending the Sentry SDK `ignoreErrors` filter. The downgrade to `logger.warn` is sufficient — `warn` does not reach Sentry regardless of the inner error type. |
| 32 | +- Refactoring catch blocks beyond what's strictly needed for the log-level change. Side cleanups limited to converting `JSON.stringify(e)` to `error: e` where it appears (matches `dac9c83bd`'s `error: e` style and the v2 file's existing form). |
| 33 | + |
| 34 | +## Affected call sites |
| 35 | + |
| 36 | +Nine sites total. Confirmed by audit (broader grep covering all `XxxError` types, plus directory sweep across `apps/webapp/app`, `apps/supervisor`, and `internal-packages`). |
| 37 | + |
| 38 | +| # | File:Line | Wrapper message | Expected types caught | Unknown-error fallback | Maps to Sentry issue | |
| 39 | +|---|---|---|---|---|---| |
| 40 | +| 1 | `apps/webapp/app/routes/api.v1.projects.$projectRef.background-workers.ts:60` | "Failed to create background worker" | `ServiceValidationError`, `CreateDeclarativeScheduleError` (both → 400) | 500 | [`2S`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-2S) | |
| 41 | +| 2 | `apps/webapp/app/routes/api.v1.deployments.$deploymentId.background-workers.ts:62` | "Failed to create background worker" | same | 500 | [`2S`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-2S) | |
| 42 | +| 3 | `apps/webapp/app/routes/api.v1.tasks.batch.ts:130` | "Batch trigger error" | `ServiceValidationError`, `OutOfEntitlementError` (both → 422) | 500 with `x-should-retry: false` | [`38`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-38) | |
| 43 | +| 4 | `apps/webapp/app/routes/api.v2.tasks.batch.ts:147` | "Batch trigger error" | same | 500 with `x-should-retry: false` | [`38`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-38) | |
| 44 | +| 5 | `apps/webapp/app/routes/api.v3.batches.ts:175` | "Create batch error" | `ServiceValidationError`, `OutOfEntitlementError` (both → 422) | 500 with `x-should-retry: false` | not currently in Sentry | |
| 45 | +| 6 | `apps/webapp/app/v3/services/createBackgroundWorker.server.ts:149` | "Error syncing declarative schedules" | `ServiceValidationError` (rethrown) | wraps unknown into new `ServiceValidationError`, throws | not currently in Sentry | |
| 46 | +| 7 | `apps/webapp/app/v3/services/createDeploymentBackgroundWorkerV4.server.ts:142` | "Error syncing declarative schedules" | same | wraps unknown into new `ServiceValidationError`, throws | not currently in Sentry | |
| 47 | +| 8 | `apps/webapp/app/routes/api.v3.batches.$batchId.items.ts:91` | "Stream batch items error" | `ServiceValidationError` → 422; `Error` with `Invalid JSON` → 400 | 500 | [`103`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-103) | |
| 48 | +| 9 | `apps/webapp/app/routes/api.v1.query.ts:69` | "Query API error" | `QueryError` → 400 | 500 | not currently in Sentry | |
| 49 | + |
| 50 | +## Approach |
| 51 | + |
| 52 | +Inline at each site; no shared helper. Three variants distinguished by call-site shape. |
| 53 | + |
| 54 | +### Variant 1 — route catches that return (sites 1, 2, 3, 4, 5, 8) |
| 55 | + |
| 56 | +Move `logger.error` after the `instanceof` chain. Only the unknown-error fall-through invokes it. Convert `else if` chains to early returns for readability. Add a one-line comment in the `dac9c83bd` voice on the catch block. |
| 57 | + |
| 58 | +Before (site 1): |
| 59 | + |
| 60 | +```ts |
| 61 | +} catch (e) { |
| 62 | + logger.error("Failed to create background worker", { error: JSON.stringify(e) }); |
| 63 | + |
| 64 | + if (e instanceof ServiceValidationError) { |
| 65 | + return json({ error: e.message }, { status: 400 }); |
| 66 | + } else if (e instanceof CreateDeclarativeScheduleError) { |
| 67 | + return json({ error: e.message }, { status: 400 }); |
| 68 | + } |
| 69 | + |
| 70 | + return json({ error: "Failed to create background worker" }, { status: 500 }); |
| 71 | +} |
| 72 | +``` |
| 73 | + |
| 74 | +After: |
| 75 | + |
| 76 | +```ts |
| 77 | +} catch (e) { |
| 78 | + // Customer-facing validation failures (invalid task config, customer cron |
| 79 | + // expression, etc.). The handler returns 4xx with the message; system |
| 80 | + // handles it gracefully, no alert needed. |
| 81 | + if (e instanceof ServiceValidationError) { |
| 82 | + return json({ error: e.message }, { status: 400 }); |
| 83 | + } |
| 84 | + if (e instanceof CreateDeclarativeScheduleError) { |
| 85 | + return json({ error: e.message }, { status: 400 }); |
| 86 | + } |
| 87 | + |
| 88 | + logger.error("Failed to create background worker", { error: e }); |
| 89 | + |
| 90 | + return json({ error: "Failed to create background worker" }, { status: 500 }); |
| 91 | +} |
| 92 | +``` |
| 93 | + |
| 94 | +Per-site differences: |
| 95 | + |
| 96 | +- Sites 3, 4, 5: type list is `[ServiceValidationError, OutOfEntitlementError]`, status `422`, response includes the `x-should-retry: false` header on the unknown path. |
| 97 | +- Site 8 keeps the `Invalid JSON` substring branch (returns 400) but moves it under the unknown-error path, after the `logger.error` call but before the generic 500 — that branch represents a parse failure on customer input and should also be a `warn`. Concrete refactor: |
| 98 | + |
| 99 | + ```ts |
| 100 | + } catch (error) { |
| 101 | + if (error instanceof ServiceValidationError) { |
| 102 | + logger.warn("Stream batch items error", { batchId, error: error.message }); |
| 103 | + return json({ error: error.message }, { status: 422 }); |
| 104 | + } |
| 105 | + |
| 106 | + if (error instanceof Error && error.message.includes("Invalid JSON")) { |
| 107 | + // Customer-supplied stream isn't valid JSON; surface as 400. |
| 108 | + logger.warn("Stream batch items error: invalid JSON", { batchId, error: error.message }); |
| 109 | + return json({ error: error.message }, { status: 400 }); |
| 110 | + } |
| 111 | + |
| 112 | + logger.error("Stream batch items error", { |
| 113 | + batchId, |
| 114 | + error: { message: (error as Error).message, stack: (error as Error).stack }, |
| 115 | + }); |
| 116 | + |
| 117 | + return json({ error: (error as Error).message ?? "Something went wrong" }, { status: 500 }); |
| 118 | + } |
| 119 | + ``` |
| 120 | + |
| 121 | +### Variant 2 — services that wrap into SVE (sites 6, 7) |
| 122 | + |
| 123 | +Split the SVE branch out before the `logger.error`. Keep `logger.error` for the wrap-real-bug path so visibility survives the SDK filter — same reasoning as `dac9c83bd`'s `waitpointCompletionPacket.server.ts` change. |
| 124 | + |
| 125 | +Before (site 6, schedules branch): |
| 126 | + |
| 127 | +```ts |
| 128 | +if (schedulesError) { |
| 129 | + logger.error("Error syncing declarative schedules", { |
| 130 | + error: schedulesError, |
| 131 | + backgroundWorker, |
| 132 | + environment, |
| 133 | + }); |
| 134 | + |
| 135 | + if (schedulesError instanceof ServiceValidationError) { |
| 136 | + throw schedulesError; |
| 137 | + } |
| 138 | + |
| 139 | + throw new ServiceValidationError("Error syncing declarative schedules"); |
| 140 | +} |
| 141 | +``` |
| 142 | + |
| 143 | +After: |
| 144 | + |
| 145 | +```ts |
| 146 | +if (schedulesError) { |
| 147 | + if (schedulesError instanceof ServiceValidationError) { |
| 148 | + // Customer schedule config (typically invalid cron). Surface to client |
| 149 | + // via the rethrow; system returns gracefully. |
| 150 | + logger.warn("Error syncing declarative schedules", { |
| 151 | + error: schedulesError.message, |
| 152 | + backgroundWorker, |
| 153 | + environment, |
| 154 | + }); |
| 155 | + throw schedulesError; |
| 156 | + } |
| 157 | + |
| 158 | + // Wrapping the underlying error into a ServiceValidationError below would |
| 159 | + // otherwise hide it once the SDK-level filter drops SVEs; log at error so |
| 160 | + // the underlying cause stays visible. Mirrors the |
| 161 | + // waitpointCompletionPacket.server.ts pattern from dac9c83bd. |
| 162 | + logger.error("Error syncing declarative schedules", { |
| 163 | + error: schedulesError, |
| 164 | + backgroundWorker, |
| 165 | + environment, |
| 166 | + }); |
| 167 | + |
| 168 | + throw new ServiceValidationError("Error syncing declarative schedules"); |
| 169 | +} |
| 170 | +``` |
| 171 | + |
| 172 | +Site 7 has the identical shape. Both files also have earlier `filesError` and `resourcesError` catches that wrap into `ServiceValidationError` unconditionally (no `instanceof` check). Those are already correct under this design — they always log the inner error at `error` before wrapping a real bug into an SVE — so no change is needed there. |
| 173 | + |
| 174 | +### Variant 3 — site 9, different error type, same fix |
| 175 | + |
| 176 | +`api.v1.query.ts` catches `QueryError` (a customer SQL error). The error type isn't covered by the SDK filter, but that doesn't matter — `logger.warn` doesn't reach Sentry regardless. Same downgrade pattern. |
| 177 | + |
| 178 | +Before: |
| 179 | + |
| 180 | +```ts |
| 181 | +if (!queryResult.success) { |
| 182 | + const message = |
| 183 | + queryResult.error instanceof QueryError |
| 184 | + ? queryResult.error.message |
| 185 | + : "An unexpected error occurred while executing the query."; |
| 186 | + |
| 187 | + logger.error("Query API error", { error: queryResult.error, query }); |
| 188 | + |
| 189 | + return json( |
| 190 | + { error: message }, |
| 191 | + { status: queryResult.error instanceof QueryError ? 400 : 500 } |
| 192 | + ); |
| 193 | +} |
| 194 | +``` |
| 195 | + |
| 196 | +After: |
| 197 | + |
| 198 | +```ts |
| 199 | +if (!queryResult.success) { |
| 200 | + if (queryResult.error instanceof QueryError) { |
| 201 | + // Customer SQL is invalid or unsupported. Returned to caller as 400. |
| 202 | + logger.warn("Query API error", { error: queryResult.error.message, query }); |
| 203 | + return json({ error: queryResult.error.message }, { status: 400 }); |
| 204 | + } |
| 205 | + |
| 206 | + logger.error("Query API error", { error: queryResult.error, query }); |
| 207 | + return json( |
| 208 | + { error: "An unexpected error occurred while executing the query." }, |
| 209 | + { status: 500 } |
| 210 | + ); |
| 211 | +} |
| 212 | +``` |
| 213 | + |
| 214 | +### Comment style |
| 215 | + |
| 216 | +Three principles, taken from `dac9c83bd`: |
| 217 | + |
| 218 | +1. **Lead with what the failure means in business terms** — *"Customer-facing validation failures …"*, not *"This is a ServiceValidationError"*. |
| 219 | +2. **State why the system is fine** — *"Handler returns 4xx; system handles it gracefully, no alert needed."* |
| 220 | +3. **Where wrapping happens, name the visibility risk explicitly** — see Variant 2's second comment, mirroring the `waitpointCompletionPacket.server.ts` callout from the original PR. |
| 221 | + |
| 222 | +Comments are 1–3 lines, indicative voice, no headers. |
| 223 | + |
| 224 | +## Tests |
| 225 | + |
| 226 | +Vitest, flat under `apps/webapp/test/` matching the existing webapp convention (`vitest.config.ts` only picks up `test/**/*.test.ts`, so co-located tests would be silently skipped). Each file gets ~3 tests. |
| 227 | + |
| 228 | +Test shape: |
| 229 | + |
| 230 | +```ts |
| 231 | +import { describe, it, expect, vi, beforeEach } from "vitest"; |
| 232 | + |
| 233 | +vi.mock("~/services/logger.server", () => ({ |
| 234 | + logger: { warn: vi.fn(), error: vi.fn(), info: vi.fn(), debug: vi.fn() }, |
| 235 | +})); |
| 236 | + |
| 237 | +// Mock the underlying service so we can throw whatever we want. |
| 238 | + |
| 239 | +describe("api.v1.projects.$projectRef.background-workers error handling", () => { |
| 240 | + beforeEach(() => vi.clearAllMocks()); |
| 241 | + |
| 242 | + it("logs ServiceValidationError at warn and returns 400", async () => { |
| 243 | + // Arrange the service to throw new ServiceValidationError("bad cron"). |
| 244 | + // Invoke the action. |
| 245 | + // Assert: logger.warn called once with the wrapper message. |
| 246 | + // logger.error not called. |
| 247 | + // response.status === 400, body.error === "bad cron". |
| 248 | + }); |
| 249 | + |
| 250 | + it("logs CreateDeclarativeScheduleError at warn and returns 400", async () => { /* ... */ }); |
| 251 | + |
| 252 | + it("logs unknown errors at error and returns 500", async () => { /* ... */ }); |
| 253 | +}); |
| 254 | +``` |
| 255 | + |
| 256 | +What we explicitly do NOT test: |
| 257 | + |
| 258 | +- The Sentry SDK integration. The premise *"`logger.warn` does not reach Sentry"* is verified once at the SDK level; tests assert on the logger contract only. |
| 259 | +- The full happy path / business logic of each route. Out of scope; would balloon test scope without adding signal on this change. |
| 260 | + |
| 261 | +Per-site test counts: |
| 262 | + |
| 263 | +- Sites 1, 2 (route catches with two expected types): 3 tests each. |
| 264 | +- Sites 3, 4, 5, 8 (route catches with two expected types or extra `Invalid JSON` branch): 3 tests each. |
| 265 | +- Site 9 (route catch with one expected type): 2 tests. |
| 266 | +- Sites 6, 7 (services with wrap-into-SVE on `schedulesError`): 2 tests each — SVE rethrow with `warn`, non-SVE wrap with `error`. |
| 267 | + |
| 268 | +Approximate total: ~24 tests across 9 files. |
| 269 | + |
| 270 | +Authentication scaffolding for action invocations can be a thin shared helper inside a `__test_helpers__` module if duplication emerges; the implementation plan decides this based on what the first 1–2 files need. Not pinned by the spec. |
| 271 | + |
| 272 | +## Rollout |
| 273 | + |
| 274 | +Single PR. Test plan in the PR description mirrors `dac9c83bd`'s structure: |
| 275 | + |
| 276 | +- [ ] `pnpm run typecheck --filter webapp` |
| 277 | +- [ ] `pnpm run test --filter webapp` covering the new `*.test.ts` files |
| 278 | +- [ ] Manual smoke on `references/hello-world`: trigger a task with a deliberately invalid cron schedule, verify 4xx response and `warn` log; no Sentry event. Repeat with valid config to confirm happy path is unaffected. |
| 279 | +- [ ] Add `.server-changes/sentry-wrapper-bypass-fix.md` per the project rule (`server-apps.md`). High-level summary, no design detail. |
| 280 | + |
| 281 | +## Risk |
| 282 | + |
| 283 | +The change is purely log-level + ordering; behaviour-preserving for users (same status codes, same response bodies). Failure modes: |
| 284 | + |
| 285 | +- **Over-quieting** — a previously `error`-logged real bug accidentally demoted to `warn`. Mitigated by tests on the unknown-error branch at every site. |
| 286 | +- **Under-quieting** — a site missed by the audit continues to escalate. Mitigated by the post-deploy verification metric below. |
| 287 | + |
| 288 | +## Verification |
| 289 | + |
| 290 | +Defined success metric, included in the PR description: |
| 291 | + |
| 292 | +48 hours after merge, Sentry event volume on the wrapper-bypass groups should drop to near-zero in the deployed release. Specifically: |
| 293 | + |
| 294 | +- [`TRIGGER-CLOUD-2S`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-2S) — was 13k lifetime, ~158 in last 48h. |
| 295 | +- [`TRIGGER-CLOUD-38`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-38) — 755 in last 48h. |
| 296 | +- [`TRIGGER-CLOUD-103`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-103) — 1,052 in last 48h. |
| 297 | + |
| 298 | +Any of those three still firing in the new release indicates a missed call site or a different error type slipping through; investigate and patch as a follow-up. |
| 299 | + |
| 300 | +## Known follow-ups |
| 301 | + |
| 302 | +Captured for separate brainstorms: |
| 303 | + |
| 304 | +1. **Wrap-into-SVE pattern more broadly.** Services that wrap unrelated errors into `ServiceValidationError` create a permanent visibility risk: even with this fix, real bugs wrapped this way only stay visible because we explicitly log at error before the wrap. Replacing the wrap with a distinct `InternalServiceError` (or simply not wrapping) would let the SDK filter and route-level handlers treat the two cases correctly without per-site care. |
| 305 | +2. **`api.v1.tasks.$taskId.trigger.ts` silent 500s.** The single-task trigger endpoint catches and returns 500 with `error.message` but never calls `logger.error`. Real bugs on the highest-volume endpoint in the system are not surfacing in Sentry. Inverse problem to this design. |
| 306 | + |
| 307 | +## References |
| 308 | + |
| 309 | +- `dac9c83bd` — chore(webapp,run-engine): downgrade boundary log noise to warn (#3462). Source for style, voice, and the `waitpointCompletionPacket.server.ts` "log inner before wrap" pattern reused here. |
| 310 | +- `apps/webapp/sentry.server.ts` — current Sentry SDK config including `ignoreErrors` filter. |
| 311 | +- Investigation Notion doc — *Sentry Triage May 2026 / Investigation: Prisma P1001 RDS Connectivity* (parent of this design's reasoning thread). |
0 commit comments