Skip to content

Commit 993fa70

Browse files
committed
chore: drop internal sentry refs from server-changes entry
1 parent 3c7dceb commit 993fa70

1 file changed

Lines changed: 311 additions & 0 deletions

File tree

Lines changed: 311 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,311 @@
1+
# Sentry wrapper-bypass log-level fix
2+
3+
**Date:** 2026-05-01
4+
**Status:** Design — not committed; lives in `_plans/` for handoff to implementation.
5+
**Related:** [`dac9c83bd`](https://github.com/triggerdotdev/trigger.dev/commit/dac9c83bd) (chore: downgrade boundary log noise to warn)
6+
7+
## Problem
8+
9+
`dac9c83bd` added a Sentry SDK-level filter — `ignoreErrors: /^ServiceValidationError(?::|$)/` in `apps/webapp/sentry.server.ts` — that drops `ServiceValidationError` events before they reach Sentry. The filter only matches when the captured event's *type* is `ServiceValidationError`; it relies on the SVE propagating as an exception.
10+
11+
Nine call sites in the webapp catch `ServiceValidationError` (and analogous user-input error types like `OutOfEntitlementError`, `CreateDeclarativeScheduleError`, `QueryError`), then call `logger.error("wrapper message", { error: e })` *before* performing the type discrimination. That `logger.error` becomes a Sentry event titled with the wrapper message, with the inner error buried in `extra.error` — invisible to the SDK filter.
12+
13+
Visible impact in production today:
14+
15+
- [`TRIGGER-CLOUD-2S`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-2S) "Failed to create background worker" — 13,013 lifetime occurrences, still firing in `19c16759f`.
16+
- [`TRIGGER-CLOUD-38`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-38) "Batch trigger error" — 755 events in last 48h, still firing.
17+
- [`TRIGGER-CLOUD-103`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-103) "Stream batch items error" — 1,052 events in last 48h, still firing.
18+
19+
In each case the underlying cause is a `ServiceValidationError` that should never have escalated.
20+
21+
## Goals
22+
23+
- Stop the wrapper-bypass groups (`2S`, `38`, `103` and the smaller related issues) from escalating expected user-input failures to Sentry.
24+
- Preserve `error`-level visibility for genuinely unknown errors at the same call sites.
25+
- Match the style and voice of `dac9c83bd` so the change reads as a continuation of that PR's work.
26+
27+
## Non-goals
28+
29+
- The "wrap unrelated errors into `ServiceValidationError`" pattern in the two service files. We address its log-level symptom here (log inner at `error` before wrapping) but do not redesign the wrapping itself. Captured as a separate brainstorm.
30+
- The single-task trigger endpoint's silent 500s (`api.v1.tasks.$taskId.trigger.ts`). Inverse problem — real bugs are being swallowed. Out of scope; separate brainstorm.
31+
- Extending the Sentry SDK `ignoreErrors` filter. The downgrade to `logger.warn` is sufficient — `warn` does not reach Sentry regardless of the inner error type.
32+
- Refactoring catch blocks beyond what's strictly needed for the log-level change. Side cleanups limited to converting `JSON.stringify(e)` to `error: e` where it appears (matches `dac9c83bd`'s `error: e` style and the v2 file's existing form).
33+
34+
## Affected call sites
35+
36+
Nine sites total. Confirmed by audit (broader grep covering all `XxxError` types, plus directory sweep across `apps/webapp/app`, `apps/supervisor`, and `internal-packages`).
37+
38+
| # | File:Line | Wrapper message | Expected types caught | Unknown-error fallback | Maps to Sentry issue |
39+
|---|---|---|---|---|---|
40+
| 1 | `apps/webapp/app/routes/api.v1.projects.$projectRef.background-workers.ts:60` | "Failed to create background worker" | `ServiceValidationError`, `CreateDeclarativeScheduleError` (both → 400) | 500 | [`2S`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-2S) |
41+
| 2 | `apps/webapp/app/routes/api.v1.deployments.$deploymentId.background-workers.ts:62` | "Failed to create background worker" | same | 500 | [`2S`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-2S) |
42+
| 3 | `apps/webapp/app/routes/api.v1.tasks.batch.ts:130` | "Batch trigger error" | `ServiceValidationError`, `OutOfEntitlementError` (both → 422) | 500 with `x-should-retry: false` | [`38`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-38) |
43+
| 4 | `apps/webapp/app/routes/api.v2.tasks.batch.ts:147` | "Batch trigger error" | same | 500 with `x-should-retry: false` | [`38`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-38) |
44+
| 5 | `apps/webapp/app/routes/api.v3.batches.ts:175` | "Create batch error" | `ServiceValidationError`, `OutOfEntitlementError` (both → 422) | 500 with `x-should-retry: false` | not currently in Sentry |
45+
| 6 | `apps/webapp/app/v3/services/createBackgroundWorker.server.ts:149` | "Error syncing declarative schedules" | `ServiceValidationError` (rethrown) | wraps unknown into new `ServiceValidationError`, throws | not currently in Sentry |
46+
| 7 | `apps/webapp/app/v3/services/createDeploymentBackgroundWorkerV4.server.ts:142` | "Error syncing declarative schedules" | same | wraps unknown into new `ServiceValidationError`, throws | not currently in Sentry |
47+
| 8 | `apps/webapp/app/routes/api.v3.batches.$batchId.items.ts:91` | "Stream batch items error" | `ServiceValidationError` → 422; `Error` with `Invalid JSON` → 400 | 500 | [`103`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-103) |
48+
| 9 | `apps/webapp/app/routes/api.v1.query.ts:69` | "Query API error" | `QueryError` → 400 | 500 | not currently in Sentry |
49+
50+
## Approach
51+
52+
Inline at each site; no shared helper. Three variants distinguished by call-site shape.
53+
54+
### Variant 1 — route catches that return (sites 1, 2, 3, 4, 5, 8)
55+
56+
Move `logger.error` after the `instanceof` chain. Only the unknown-error fall-through invokes it. Convert `else if` chains to early returns for readability. Add a one-line comment in the `dac9c83bd` voice on the catch block.
57+
58+
Before (site 1):
59+
60+
```ts
61+
} catch (e) {
62+
logger.error("Failed to create background worker", { error: JSON.stringify(e) });
63+
64+
if (e instanceof ServiceValidationError) {
65+
return json({ error: e.message }, { status: 400 });
66+
} else if (e instanceof CreateDeclarativeScheduleError) {
67+
return json({ error: e.message }, { status: 400 });
68+
}
69+
70+
return json({ error: "Failed to create background worker" }, { status: 500 });
71+
}
72+
```
73+
74+
After:
75+
76+
```ts
77+
} catch (e) {
78+
// Customer-facing validation failures (invalid task config, customer cron
79+
// expression, etc.). The handler returns 4xx with the message; system
80+
// handles it gracefully, no alert needed.
81+
if (e instanceof ServiceValidationError) {
82+
return json({ error: e.message }, { status: 400 });
83+
}
84+
if (e instanceof CreateDeclarativeScheduleError) {
85+
return json({ error: e.message }, { status: 400 });
86+
}
87+
88+
logger.error("Failed to create background worker", { error: e });
89+
90+
return json({ error: "Failed to create background worker" }, { status: 500 });
91+
}
92+
```
93+
94+
Per-site differences:
95+
96+
- Sites 3, 4, 5: type list is `[ServiceValidationError, OutOfEntitlementError]`, status `422`, response includes the `x-should-retry: false` header on the unknown path.
97+
- Site 8 keeps the `Invalid JSON` substring branch (returns 400) but moves it under the unknown-error path, after the `logger.error` call but before the generic 500 — that branch represents a parse failure on customer input and should also be a `warn`. Concrete refactor:
98+
99+
```ts
100+
} catch (error) {
101+
if (error instanceof ServiceValidationError) {
102+
logger.warn("Stream batch items error", { batchId, error: error.message });
103+
return json({ error: error.message }, { status: 422 });
104+
}
105+
106+
if (error instanceof Error && error.message.includes("Invalid JSON")) {
107+
// Customer-supplied stream isn't valid JSON; surface as 400.
108+
logger.warn("Stream batch items error: invalid JSON", { batchId, error: error.message });
109+
return json({ error: error.message }, { status: 400 });
110+
}
111+
112+
logger.error("Stream batch items error", {
113+
batchId,
114+
error: { message: (error as Error).message, stack: (error as Error).stack },
115+
});
116+
117+
return json({ error: (error as Error).message ?? "Something went wrong" }, { status: 500 });
118+
}
119+
```
120+
121+
### Variant 2 — services that wrap into SVE (sites 6, 7)
122+
123+
Split the SVE branch out before the `logger.error`. Keep `logger.error` for the wrap-real-bug path so visibility survives the SDK filter — same reasoning as `dac9c83bd`'s `waitpointCompletionPacket.server.ts` change.
124+
125+
Before (site 6, schedules branch):
126+
127+
```ts
128+
if (schedulesError) {
129+
logger.error("Error syncing declarative schedules", {
130+
error: schedulesError,
131+
backgroundWorker,
132+
environment,
133+
});
134+
135+
if (schedulesError instanceof ServiceValidationError) {
136+
throw schedulesError;
137+
}
138+
139+
throw new ServiceValidationError("Error syncing declarative schedules");
140+
}
141+
```
142+
143+
After:
144+
145+
```ts
146+
if (schedulesError) {
147+
if (schedulesError instanceof ServiceValidationError) {
148+
// Customer schedule config (typically invalid cron). Surface to client
149+
// via the rethrow; system returns gracefully.
150+
logger.warn("Error syncing declarative schedules", {
151+
error: schedulesError.message,
152+
backgroundWorker,
153+
environment,
154+
});
155+
throw schedulesError;
156+
}
157+
158+
// Wrapping the underlying error into a ServiceValidationError below would
159+
// otherwise hide it once the SDK-level filter drops SVEs; log at error so
160+
// the underlying cause stays visible. Mirrors the
161+
// waitpointCompletionPacket.server.ts pattern from dac9c83bd.
162+
logger.error("Error syncing declarative schedules", {
163+
error: schedulesError,
164+
backgroundWorker,
165+
environment,
166+
});
167+
168+
throw new ServiceValidationError("Error syncing declarative schedules");
169+
}
170+
```
171+
172+
Site 7 has the identical shape. Both files also have earlier `filesError` and `resourcesError` catches that wrap into `ServiceValidationError` unconditionally (no `instanceof` check). Those are already correct under this design — they always log the inner error at `error` before wrapping a real bug into an SVE — so no change is needed there.
173+
174+
### Variant 3 — site 9, different error type, same fix
175+
176+
`api.v1.query.ts` catches `QueryError` (a customer SQL error). The error type isn't covered by the SDK filter, but that doesn't matter — `logger.warn` doesn't reach Sentry regardless. Same downgrade pattern.
177+
178+
Before:
179+
180+
```ts
181+
if (!queryResult.success) {
182+
const message =
183+
queryResult.error instanceof QueryError
184+
? queryResult.error.message
185+
: "An unexpected error occurred while executing the query.";
186+
187+
logger.error("Query API error", { error: queryResult.error, query });
188+
189+
return json(
190+
{ error: message },
191+
{ status: queryResult.error instanceof QueryError ? 400 : 500 }
192+
);
193+
}
194+
```
195+
196+
After:
197+
198+
```ts
199+
if (!queryResult.success) {
200+
if (queryResult.error instanceof QueryError) {
201+
// Customer SQL is invalid or unsupported. Returned to caller as 400.
202+
logger.warn("Query API error", { error: queryResult.error.message, query });
203+
return json({ error: queryResult.error.message }, { status: 400 });
204+
}
205+
206+
logger.error("Query API error", { error: queryResult.error, query });
207+
return json(
208+
{ error: "An unexpected error occurred while executing the query." },
209+
{ status: 500 }
210+
);
211+
}
212+
```
213+
214+
### Comment style
215+
216+
Three principles, taken from `dac9c83bd`:
217+
218+
1. **Lead with what the failure means in business terms***"Customer-facing validation failures …"*, not *"This is a ServiceValidationError"*.
219+
2. **State why the system is fine***"Handler returns 4xx; system handles it gracefully, no alert needed."*
220+
3. **Where wrapping happens, name the visibility risk explicitly** — see Variant 2's second comment, mirroring the `waitpointCompletionPacket.server.ts` callout from the original PR.
221+
222+
Comments are 1–3 lines, indicative voice, no headers.
223+
224+
## Tests
225+
226+
Vitest, flat under `apps/webapp/test/` matching the existing webapp convention (`vitest.config.ts` only picks up `test/**/*.test.ts`, so co-located tests would be silently skipped). Each file gets ~3 tests.
227+
228+
Test shape:
229+
230+
```ts
231+
import { describe, it, expect, vi, beforeEach } from "vitest";
232+
233+
vi.mock("~/services/logger.server", () => ({
234+
logger: { warn: vi.fn(), error: vi.fn(), info: vi.fn(), debug: vi.fn() },
235+
}));
236+
237+
// Mock the underlying service so we can throw whatever we want.
238+
239+
describe("api.v1.projects.$projectRef.background-workers error handling", () => {
240+
beforeEach(() => vi.clearAllMocks());
241+
242+
it("logs ServiceValidationError at warn and returns 400", async () => {
243+
// Arrange the service to throw new ServiceValidationError("bad cron").
244+
// Invoke the action.
245+
// Assert: logger.warn called once with the wrapper message.
246+
// logger.error not called.
247+
// response.status === 400, body.error === "bad cron".
248+
});
249+
250+
it("logs CreateDeclarativeScheduleError at warn and returns 400", async () => { /* ... */ });
251+
252+
it("logs unknown errors at error and returns 500", async () => { /* ... */ });
253+
});
254+
```
255+
256+
What we explicitly do NOT test:
257+
258+
- The Sentry SDK integration. The premise *"`logger.warn` does not reach Sentry"* is verified once at the SDK level; tests assert on the logger contract only.
259+
- The full happy path / business logic of each route. Out of scope; would balloon test scope without adding signal on this change.
260+
261+
Per-site test counts:
262+
263+
- Sites 1, 2 (route catches with two expected types): 3 tests each.
264+
- Sites 3, 4, 5, 8 (route catches with two expected types or extra `Invalid JSON` branch): 3 tests each.
265+
- Site 9 (route catch with one expected type): 2 tests.
266+
- Sites 6, 7 (services with wrap-into-SVE on `schedulesError`): 2 tests each — SVE rethrow with `warn`, non-SVE wrap with `error`.
267+
268+
Approximate total: ~24 tests across 9 files.
269+
270+
Authentication scaffolding for action invocations can be a thin shared helper inside a `__test_helpers__` module if duplication emerges; the implementation plan decides this based on what the first 1–2 files need. Not pinned by the spec.
271+
272+
## Rollout
273+
274+
Single PR. Test plan in the PR description mirrors `dac9c83bd`'s structure:
275+
276+
- [ ] `pnpm run typecheck --filter webapp`
277+
- [ ] `pnpm run test --filter webapp` covering the new `*.test.ts` files
278+
- [ ] Manual smoke on `references/hello-world`: trigger a task with a deliberately invalid cron schedule, verify 4xx response and `warn` log; no Sentry event. Repeat with valid config to confirm happy path is unaffected.
279+
- [ ] Add `.server-changes/sentry-wrapper-bypass-fix.md` per the project rule (`server-apps.md`). High-level summary, no design detail.
280+
281+
## Risk
282+
283+
The change is purely log-level + ordering; behaviour-preserving for users (same status codes, same response bodies). Failure modes:
284+
285+
- **Over-quieting** — a previously `error`-logged real bug accidentally demoted to `warn`. Mitigated by tests on the unknown-error branch at every site.
286+
- **Under-quieting** — a site missed by the audit continues to escalate. Mitigated by the post-deploy verification metric below.
287+
288+
## Verification
289+
290+
Defined success metric, included in the PR description:
291+
292+
48 hours after merge, Sentry event volume on the wrapper-bypass groups should drop to near-zero in the deployed release. Specifically:
293+
294+
- [`TRIGGER-CLOUD-2S`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-2S) — was 13k lifetime, ~158 in last 48h.
295+
- [`TRIGGER-CLOUD-38`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-38) — 755 in last 48h.
296+
- [`TRIGGER-CLOUD-103`](https://triggerdev.sentry.io/issues/TRIGGER-CLOUD-103) — 1,052 in last 48h.
297+
298+
Any of those three still firing in the new release indicates a missed call site or a different error type slipping through; investigate and patch as a follow-up.
299+
300+
## Known follow-ups
301+
302+
Captured for separate brainstorms:
303+
304+
1. **Wrap-into-SVE pattern more broadly.** Services that wrap unrelated errors into `ServiceValidationError` create a permanent visibility risk: even with this fix, real bugs wrapped this way only stay visible because we explicitly log at error before the wrap. Replacing the wrap with a distinct `InternalServiceError` (or simply not wrapping) would let the SDK filter and route-level handlers treat the two cases correctly without per-site care.
305+
2. **`api.v1.tasks.$taskId.trigger.ts` silent 500s.** The single-task trigger endpoint catches and returns 500 with `error.message` but never calls `logger.error`. Real bugs on the highest-volume endpoint in the system are not surfacing in Sentry. Inverse problem to this design.
306+
307+
## References
308+
309+
- `dac9c83bd` — chore(webapp,run-engine): downgrade boundary log noise to warn (#3462). Source for style, voice, and the `waitpointCompletionPacket.server.ts` "log inner before wrap" pattern reused here.
310+
- `apps/webapp/sentry.server.ts` — current Sentry SDK config including `ignoreErrors` filter.
311+
- Investigation Notion doc — *Sentry Triage May 2026 / Investigation: Prisma P1001 RDS Connectivity* (parent of this design's reasoning thread).

0 commit comments

Comments
 (0)